The Annals of Statistics 

2004, Vol. 32, No. 4, 1367-1433 

DOI: 10.1214/009053604000000553 

© Institute of Mathematical Statistics, 2004 

GAME THEORY, MAXIMUM ENTROPY, MINIMUM 
DISCREPANCY AND ROBUST BAYESIAN DECISION THEORY 1 

By Peter D. Grunwald and A. Philip Dawid 
CWI Amsterdam and University College London 

We describe and develop a close relationship between two prob- 
lems that have customarily been regarded as distinct: that of max- 
imizing entropy, and that of minimizing worst-case expected loss. 
Using a formulation grounded in the equilibrium theory of zero-sum 
games between Decision Maker and Nature, these two problems are 
shown to be dual to each other, the solution to each providing that to 
the other. Although Tops0e described this connection for the Shan- 
non entropy over 20 years ago, it does not appear to be widely known 
even in that important special case. 

We here generalize this theory to apply to arbitrary decision prob- 
lems and loss functions. We indicate how an appropriate generalized 
definition of entropy can be associated with such a problem, and 
we show that, subject to certain regularity conditions, the above- 
mentioned duality continues to apply in this extended context. This 
simultaneously provides a possible rationale for maximizing entropy 
and a tool for finding robust Bayes acts. We also describe the essen- 
tial identity between the problem of maximizing entropy and that 
of minimizing a related discrepancy or divergence between distribu- 
tions. This leads to an extension, to arbitrary discrepancies, of a well- 
known minimax theorem for the case of Kullback-Leibler divergence 
(the "redundancy-capacity theorem" of information theory) . 
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For the important case of families of distributions having certain 
mean values specified, we develop simple sufficient conditions and 
methods for identifying the desired solutions. We use this theory to 
introduce a new concept of "generalized exponential family" linked 
to the specific decision problem under consideration, and we demon- 
strate that this shares many of the properties of standard exponen- 
tial families. 

Finally, we show that the existence of an equilibrium in our game 
can be rephrased in terms of a "Pythagorean property" of the re- 
lated divergence, thus generalizing previously announced results for 
Kullback-Leibler and Bregman divergences. 

1. Introduction. Suppose that, for purposes of inductive inference or 
choosing an optimal decision, we wish to select a single distribution P* 
to act as representative of a class V of such distributions. The maximum 
entropy principle [Jaynes (1989), Csiszar (1991) and Kapur and Kesavan 
(1992)] is widely applied for this purpose, but its rationale has often been 
controversial [see, e.g., van Fraassen (1981), Shimony (1985), Skyrms (1985), 
Jaynes (1985), Seidenfeld (1986) and Uffink (1995, 1996)]. Here we empha- 
size and generalize a reinterpretation of the maximum entropy principle 
[Tops0e (1979), Walley (1991), Chapter 5, Section 12, and Griinwald (1998)]: 
that the distribution P* that maximizes the entropy over V also minimizes 
the worst-case expected logarithmic score (log loss). In the terminology of de- 
cision theory [Berger (1985)], P* is a robust Bayes, or T-minimax, act, when 
loss is measured by the logarithmic score. This gives a decision-theoretic in- 
terpretation of maximum entropy. 

In this paper we extend this result to apply to a generalized concept of 
entropy, tailored to whatever loss function L is regarded as appropriate, 
not just logarithmic score. We show that, under regularity conditions, max- 
imizing this generalized entropy constitutes the major step toward finding 
the robust Bayes ('T-minimax") act against T with respect to L. For the 
important special case that T is described by mean-value constraints, we 
give theorems that in many cases allow us to find the maximum general- 
ized entropy distribution explicitly. We further define generalized exponen- 
tial families of distributions, which, for the case of the logarithmic score, 
reduce to the usual exponential families. We extend generalized entropy to 
generalized relative entropy and show how this is essentially the same as a 
general decision-theoretic definition of discrepancy. We show that the family 
of divergences between probability measures known as Bregman divergences 
constitutes a special case of such discrepancies. A discrepancy can also be 
used as a loss function in its own right: we show that a minimax result for 
relative entropy [Haussler (1997)] can be extended to this more general case. 
We further show that a "Pythagorean property" [Csiszar (1991)] known to 
hold for relative entropy and for Bregman divergences in fact applies much 
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more generally; and we give a precise characterization of those discrepancies 
for which it holds. 

Our analysis is game-theoretic, a crucial concern being the existence and 
properties of a saddle-point, and its associated minimax and maximin acts, 
in a suitable zero-sum game between Decision Maker and Nature. 

1.1. A word of caution. It is not our purpose either to advocate or to 
criticize the maximum entropy or robust Bayes approach: we adopt a philo- 
sophically neutral stance. Rather, our aim is mathematical unification. By 
generalizing the concept of entropy beyond the standard Shannon frame- 
work, we obtain a variety of interesting characterizations of maximum gen- 
eralized entropy and display its connections with other known concepts and 
results. 

The connection with T-minimax might be viewed, by those who already 
regard robust Bayes as a well-founded principle, as a justification for max- 
imizing entropy — but it should be noted that T-minimax, like all minimax 
approaches, is not without problems of its own [Berger (1985)]. We must also 
point out that some of the more problematic aspects of maximum entropy 
inference, such as the incompatibility of maximum entropy with Bayesian 
updating [Seidenfeld (1986) and Uffink (1996)], carry over to our general- 
ized setting: in the words of one referee, rather than resolving this problem, 
we "spread it to a new level of abstraction and generality." Although these 
dangers must be firmly held in mind when considering the implications of 
this work for inductive inference, they do not undermine the mathematical 
connections established. 

2. Overview. We start with an overview of our results. For ease of ex- 
position, we make several simplifying assumptions, such as a finite sample 
space, in this section. These assumptions will later be relaxed. 

2.1. Maximum entropy and game theory. Let X be a finite sample space, 
and let T be a family of distributions over X . Consider a Decision Maker 
(DM) who has to make a decision whose consequences will depend on the 
outcome of a random variable X defined on X. DM is willing to assume 
that X is distributed according to some P £ T, a known family of dis- 
tributions over X, but he or she does not know which such distribution 
applies. DM would like to pick a single P* £ V to base decisions on. One 
way of selecting such a P* is to apply the maximum entropy principle 
[Jaynes (1989)], which advises DM to pick that distribution P* £ V max- 
imizing H{P) over all P £ V. Here H(P) denotes the Shannon entropy of P, 
H(P) := — J2xgxP( x ) logp(x) = Ep{— logp(X)}, where p is the probability 
mass function of P. However, the various rationales offered in support of 
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this advice have often been unclear or disputed. Here we shall present a 
game-theoretic rationale, which some may find attractive. 

Let A be the set of all probability mass functions defined over X. By the 
information inequality [Cover and Thomas (1991)], we have that, for any 
distribution P, inf^g^Epj — logq(X)} is achieved uniquely at q = p, where 
it takes the value H{P). That is, H{P) = inf,j g _4 Ep{— logg(X)}, and so the 
maximum entropy can be written as 

(1) sup = sup inf E P {-logg(X)}. 
Per p e r<je.4 

Now consider the "log loss game" [Good (1952)], in which DM has to 
specify some q G A, and DM's ensuing loss if Nature then reveals X = x 
is measured by — \ogq{x). Alternatively, we can consider the "code-length 
game" [Tops0e (1979) and Harremoes and Tops0e (2001)], wherein we re- 
quire DM to specify a prefix-free code a, mapping X into a suitable set 
of finite binary strings, and to measure his or her loss when X = x by the 
length k(x) of the codeword a(x). Thus DM's objective is to minimize ex- 
pected code-length. Basic results of coding theory [see, e.g., Dawid (1992)] 
imply that we can associate with a a probability mass function q having 
q(x) = 2~ K ( X \ Then, up to a constant, — \ogq(x) becomes identical with the 
code-length k(x), so that the log loss game is essentially equivalent to the 
code-length game. 

By analogy with minimax results of game theory, one might conjecture 
that 

(2) sup inf E P {- log q(X)} = inf sup E P {- log q(X)}. 

As we have seen, P achieving the supremum on the left-hand side of (2) is a 
maximum entropy distribution in Y. However, just as important, q achieving 
the infimum on the right-hand side of (2) is a robust Bayes act against V, 
or a r -minimax act [Berger (1985)], for the log loss decision problem. 

Now it turns out that, when T is closed and convex, (2) does indeed hold 
under very general conditions. Moreover the infimum on the right-hand side 
is achieved uniquely for q = p* , the probability mass function of the maxi- 
mum entropy distribution P* . Thus, in this game between DM and Nature, 
the maximum entropy distribution P* may be viewed, simultaneously, as 
defining both Nature's maximin and — in our view more interesting — DM's 
minimax strategy. In other words, maximum entropy is robust Bayes. This 
decision-theoretic reinterpretation might now be regarded as a plausible jus- 
tification for selecting the maximum entropy distribution. Note particularly 
that we do not restrict the acts q available to DM to those corresponding 
to a distribution in the restricted set V: that the optimal act p* does indeed 
turn out to have this property is a consequence of, not a restriction on, the 
analysis. 
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The maximum entropy method has been most commonly applied in the 
setting where V is described by mean-value constraints [Jaynes (1989) and 
Csiszar (1991)]: T = {P:E P (T) = r}, where T = t{X) £ K k is some given 
real- or vector-valued statistic. As pointed out by Grunwald (1998), for 
such constraints the property (2) is particularly easy to show. By the gen- 
eral theory of exponential families [Barndorff-Nielsen (1978)], under some 
mild conditions on r there will exist a distribution P* satisfying the con- 
straint Ep* (T) = r and having probability mass function of the form p* (x) = 
exp{ao + a T t(x)} for some a € lZ k , oto € 1Z. Then, for any PsT, 

(3) E P {-logp*{X)} = -a - a T Ep(T) = -a - a T T = H(P*). 

We thus see that p* is an "equalizer rule" against T, having the same ex- 
pected loss under any P S T. 

To see that P* maximizes entropy observe that, for any P E T, 

(4) H(P) = mfE P {- log q(X)} <E P {-logp*(X)} = H(P*), 

by (3). 

To see that p* is robust Bayes and that (2) holds, note that, for any q G A, 

(5) S u V Ep{-logq(X)}>Ep*{-logq(X)}>E P *{-log P *(X)} = H(P*), 
Per 

where the second inequality is the information inequality [Cover and Thomas 
(1991)]. Hence 

(6) H{P*) < inf supE P {-logg(X)}. 

q&A pgr 

However, it follows trivially from the "equalizer" property (3) of p* that 

(7) S upE P {-logp*(X)} = H(P*). 
Per 

From (6) and (7), we see that the choice q = p* achieves the infimum on the 
right-hand side of (2) and is thus robust Bayes. Moreover, (2) holds, with 
both sides equal to H(P*). 

The above argument can be extended to much more general sample spaces 
(see Section 7) . Although this game-theoretic approach and result date back 
at least to Tops0e (1979), they seem to have attracted little attention so far. 

2.2. This work: generalized entropy. The above robust Bayes view of 
maximum entropy might be regarded as justifying its use in those decision 
problems, such as discrete coding and Kelly gambling [Cover and Thomas 
(1991)], where the log loss is clearly an appropriate loss function to use. 
But what if we are interested in other loss functions? This is the principal 
question we address in this paper. 
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2.2.1. Generalized entropy and robust Bayes acts. We first recall, in Sec- 
tion 3, a natural generalization of the concept of "entropy" (or "uncertainty 
inherent in a distribution"), related to a specific decision problem and loss 
function facing DM. The generalized entropy thus associated with the log 
loss problem is just the Shannon entropy. More generally, let A be some 
space of actions or decisions and let X be the (not necessarily finite) space 
of possible outcomes to be observed. Let the loss function be given by 
L: X x A — > (—00,00], and let T be a convex set of distributions over X. 
In Sections 4-6 we set up a statistical game Q T based on these ingredients 
and use this to show that, under a variety of broad regularity conditions, 
the distribution P* maximizing, over T, the generalized entropy associated 
with the loss function L has a Bayes act a* £ A [achieving inf ag _4 L(P* , a)] 
that is a robust Bayes (r-minimax) decision relative to L — thus generalizing 
the result for the log loss described in Section 2.1. Some variations on this 
result are also given. 

2.2.2. Generalized exponential families. In Section 7 we consider in de- 
tail the case of mean-value constraints, of the form T = {P:Ep(T) = r}. 
For fixed loss function L and statistic T, as r varies we obtain a family 
of maximum generalized entropy distributions, one for each value of r. For 
Shannon entropy, this turns out to coincide with the exponential family hav- 
ing natural sufficient statistic T [Csiszar (1975)]. In close analogy we define 
the collection of maximum generalized entropy distributions, as we vary r, 
to be the generalized exponential family determined by L and T, and we 
give several examples of such generalized exponential families. In particular, 
Lafferty's "additive models based on Bregman divergences" [Lafferty (1999)] 
are special cases of our generalized exponential families (Section 8.4.2). 

2.2.3. Generalized relative entropy and discrepancy. In Section 8 we de- 
scribe how generalized entropy extends to generalized relative entropy and 
show how this in turn is intimately related to a discrepancy or divergence 
function. Maximum generalized relative entropy then becomes a special 
case of the minimum discrepancy method. For the log loss, the associated 
discrepancy function is just the familiar Kullback-Leibler divergence, and 
the method then coincides with the "classical" minimum relative entropy 
method [Jaynes (1989); note that, for Jaynes, "relative entropy" is the same 
as Kullback-Leibler divergence; for us it is the negative of this]. 

2.2.4. A generalized redundancy- capacity theorem. In many statistical 
decision problems it is more natural to seek minimax decisions with re- 
spect to the discrepancy associated with a loss, rather than with respect to 
the loss directly. With any game we thus associate a new "derived game," 
in which the discrepancy constructed from the loss function of the original 
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game now serves as a new loss function. In Section 9 we show that our 
minimax theorems apply to games of this form too: broadly, whenever the 
conditions for such a theorem hold for the original game, they also hold for 
the derived game. As a special case, we reprove a minimax theorem for the 
Kullback-Leibler divergence [Haussler (1997)], known in information theory 
as the redundancy-capacity theorem [Merhav and Feder (1995)]. 

2.2.5. The Pythagorean property. The Kullback-Leibler divergence has 
a celebrated property reminiscent of squared Euclidean distance: it satisfies 
an analogue of the Pythagorean theorem [Csiszar (1975)]. It has been noted 
[Csiszar (1991), Jones and Byrne (1990) and Lafferty (1999)] that a version 
of this property is shared by the broader class of Bregman divergences. In 
Section 10 we show that a "Pythagorean inequality" in fact holds for the 
discrepancy based on an arbitrary loss function L, so long as the game Q r has 
a value; that is, an analogue of (2) holds. Such decision-based discrepancies 
include Bregman divergences as special cases. We demonstrate that, even 
for the case of mean- value constraints, the Pythagorean inequality for a 
Bregman divergence may be strict. 

2.2.6. Finally, Section 11 takes stock of what has been achieved and 
presents some suggestions for further development. 

3. Decision problems. In this section we set out some general defini- 
tions and properties we shall require. For more background on the concepts 
discussed here, see Dawid (1998). 

A DM has to take some action a selected from a given action space A, after 
which Nature will reveal the value x £ X of a quantity X, and DM will then 
suffer a loss L(x,a) in (—00,00]. We suppose that Nature takes no account 
of the action chosen by DM. Then this can be considered as a zero-sum 
game between Nature and DM, with both players moving simultaneously, 
and DM paying Nature L(x, a) after both moves are revealed. We call such 
a combination Q := (X,A,L) a basic game. 

Both DM and Nature are also allowed to make randomized moves, such a 
move being described by a probability distribution P over X (for Nature) or 
£ over A (for DM). We assume that suitable cr-fields, containing all singleton 
sets, have been specified in X and A, and that any probability distributions 
considered are defined over the relevant a-field; we denote the family of all 
such probability distributions on X by Vo . We further suppose that the loss 
function L is jointly measurable. 

3.1. Expected loss. We shall permit algebraic operations on the extended 
real line [—00,00], with definitions and exceptions as in Rockafellar (1970), 
Section 4. 
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For a function / : X — > [—00, 00], and P G Vo, we may denote Ep{/(X)} 
[i.e., Ex~p{f(X)}] by f(P). When / is bounded below, f(P) is construed 
as 00 if P{f(X) = 00} > 0. When / is unbounded, we interpret f(P) as 
f + (P) - f~(P) G [-00, +00], where f + (x) := max{/(x), 0} and f~(x) : = 
max{— f(x), 0}, allowing either f + (P) or f~(P) to take the value 00, but 
not both. In this last case f(P) is undefined, else it is defined (either as a 
finite number or as ±00). 

If DM knows that Nature is generating X from P or, in the absence of 
such knowledge, DM is using P to represent his or her own uncertainty 
about X, then the undesirability to DM of any act a £ A will be assessed 
by means of its expected loss, 

(8) L(P,a):=E P {L(X,a)}. 

We can similarly extend L to randomized acts: L{x,C) := E^^{L(x, A)}, 
L(P,() = E {XA) ^ Px( {L(X,A)}. 

Throughout this paper we shall mostly confine attention to probability 
measures P G Vo such that L(P, a) is defined for all a & A, and we shall 
denote the family of all such P by V. We further confine attention to ran- 
domized acts C such that L(P,Q is defined for all P G V, denoting the set 
of all such £ by Z. Note that any distribution degenerate at a point x £ X 
is in V, and so L(x, Q is defined for all x G X, ( £ Z. 



Lemma 3.1. For all P £V ', C G Z , 
(9) L(P, C) = E x ^p{L(X,()} = E^ C {L(P, A)}. 

Proof. When L(P,Q is finite this is just Fubini's theorem. 

Now consider the case L{P,C t ) = 00. First suppose L > everywhere. 
If L(x,C) = 00 for x in a subset of X having positive P-measure, then (9) 
holds, both sides being +00. Otherwise, L(x,Q is finite almost surely [P]. 
If Ep{L(X, £)} were finite, then by Fubini it would be the same as L(P, £). 
So once again E P {L(A", ()} = L(P, Q = +00. 

This result now extends easily to possibly negative L, on noting that 
L~(P,Q must be finite; a parallel result holds when L{P,Q) = —00. 

Finally the whole argument can be repeated after interchanging the roles 
of x and a and of P and £. □ 



Corollary 3.1. For any P G V, 
(10) inf L(P,()= infL(P, a). 



PROOF. Clearly inf^ L(P, Q < inf ae _4 L(P, a) . If inf ae ^ L(P, a) = —00 
we are done. Otherwise, for any £ G i?, L(P, £) = E^^Z^P, A) > inf ae _4 L(P, a). 
□ 
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We shall need the fact that, for any £ G Z, L(P, (") is linear in P in the 
following sense. 

Lemma 3.2. Let P ,P 1 G V , and let P x := (1 - A)P + \P X . Fix ( G 
i?, swc/i £/ie pair {L(Pq, £), L(P\, £)} does no£ contain both the values 
-co and +oo. Then, for any A G (0, 1), L(P\,Q is finite if and only if both 
L(Pi,C) andL(P ,() are. Inthis case L(P X , C) = (1 - A) L(P , C) + AL(Pi, 0- 

Proof. Consider a bivariate random variable (I,X) with joint distri- 
bution P* over {0, 1} x X specified by the following: 7=1,0 with respective 
probabilities A, 1 — A; and, given I = i, X has distribution Pi. By Fubini we 
have 

E P *{L(X, 0} = Ep* [E P *{L(X, 0\I}}, 

in the sense that, whenever one side of this equation is defined and finite, 
the same holds for the other, and they are equal. Noting that, under P* , the 
distribution of X is P\ marginally, and Pi conditional on / = i (i = 0, 1), the 
result follows. □ 

3.2. Bayes act. Intuitively, when X ~ P an act ap G A will be optimal 
if it minimizes L(P, a) over all a G A. Any such act ap is a Bayes act against 
P. More generally, to allow for the possibility that L(P, a) may be infinite as 
well as to take into account randomization, we call Cp S Z a (randomized) 
Bayes act, or simply Bayes, against P (not necessarily in V) if 

(11) E p {L(X,C)-L(X,Cp)}g[0,oo] 

for all C G Z. We denote by Ap (resp. Zp) the set of all nonrandomized 
(resp. randomized) Bayes acts against P. Clearly Ap C Zp, and L(PXp) is 
the same for all Cp £ %P- 

The loss function L will be called T-strict if, for each P G T, there ex- 
ists ap £ A that is the unique Bayes act against P; L is T-semistrict if, 
for each P G V, Ap is nonempty, and a, a' G .Ap L(-,a) = L(-,a'). When 
L is T-strict, and P G T, it can never be optimal for DM to choose a ran- 
domized act; when L is T-semistrict, even though a randomized act can be 
optimal there is never any point in choosing one, since its loss function will 
be identical with that of any nonrandomized optimal act. 

Semistrictness is clearly weaker than strictness. For our purposes we can 
replace it by the still weaker concept of relative strictness: L is Y -relatively 
strict if for all P G T the set of Bayes acts Ap is nonempty and, for all 
a, a' G A P , L(P', a) = L{P' , a') for all P' G T. 
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3.3. Bayes loss and entropy. Whether or not a Bayes act exists, the 
Bayes loss H(P) G [—00, 00] of a distribution P G V is defined by 



It follows from Corollary 3.1 that it would make no difference if the infimum 
in (12) were extended to be over £ G Z. We shall mostly be interested in 
Bayes acts of distributions P with finite H(P). In the context of Section 2.1, 
with L(x, q) the log loss — log q(x), H(P) is just the Shannon entropy of P. 

Proposition 3.1. Let P£V and suppose H{P) is finite. Then the 
following hold: 

(i) £p G Z is Bayes against P if and only if 



for all a G A. 

(ii) Cp is Bayes against P if and only if L(P,C > p) = H(P). 

(iii) If P admits some randomized Bayes act, then P also admits some 
nonrandomized Bayes act; that is, Ap is not empty. 

Proof. Items (i) and (ii) follow easily from (10) and finiteness. To 
prove (iii), let f(P,a) := L(P,a) - H(P). Then f(P,a) > for all a, while 
E A ^ p f(P,A) = L(P,( P )- H(P) = 0. We deduce that {a G A : f(P, a) = 0} 
has probability 1 under (p and so, in particular, must be nonempty. □ 

We express the well-known concavity property of the Bayes loss [DeGroot 
(1970), Section 8.4] as follows. 

Proposition 3.2. Let P ,Pi G V, and let P x := (1 - A)P + AP X . Sup- 
pose that H(Pi) < 00 for i = 0, 1. Then H(P\) is a concave function of X on 
[0, 1] (and thus, in particular, continuous on (0, 1) and lower semicontinuous 
on [0, 1]). It is either bounded above on [0, 1] or infinite everywhere on (0, 1). 

Proof. Let B be the set of all a G A such that L(P\, a) < 00 for some A G (0 
and thus, by Lemma 3.2, for all A G [0, 1]. If B is empty, then H(P\) = 00 for 
all A G (0, 1); in particular, H(P\) is then concave on [0, 1]. Otherwise, tak- 
ing any fixed o G B we have H(P\) < L(P\,a) < maxjL(Pj,a), so H(P\) is 
bounded above on [0, 1]. Moreover, as the pointwise infimum of the nonempty 
family of concave functions {L(P\,a) : a G .A}, H(P\) is itself a concave func- 
tion of A on [0, 1]. □ 

COROLLARY 3.2. If for all a £ A, L(P Xl a) < 00 for some A G (0,1), 
then for all A G [0, 1], H(P\) = lim{ H(P fl ) : fi G [0, 1], /x — ► A} [it being allowed 
that H(P\) is not finite}. 



(12) 



H(P) := inf L(P,a). 



(13) 



E p {L(A>)-L(X,Cp)}g[0,oo] 
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Proof. In this case B = A, so that H(P\) = inf ae g L(P\,a). Each func- 
tion L(P\,a) is finite and linear, hence a closed concave function of A on 
[0, 1]. This last property is then preserved on taking the infimum. The result 
now follows from Theorem 7.5 of Rockafellar (1970). □ 

Corollary 3.3. If in addition H{Pi) is finite for i = 0,l, then H{P\) 
is a bounded continuous function of A on [0,1]. 

Note that Corollary 3.3 will always apply when the loss function is bounded. 

Under some further regularity conditions [see Dawid (1998, 2003) and 
Section 3.5.4 below], a general concave function over V can be regarded as 
generated from some decision problem by means of (12). Concave functions 
have been previously proposed as general measures of the uncertainty or 
diversity in a distribution [DeGroot (1962) and Rao (1982)], generalizing 
the Shannon entropy. We shall thus call the Bayes loss H, as given by (12), 
the {generalized) entropy function or uncertainty function associated with 
the loss function L. 

3.4. Scoring rule. Suppose the action space A is itself a set Q of distri- 
butions for X. Note we are not here considering Q G Q as a randomized act 
over X, but rather as a simple act in its own right (e.g., a decision to quote 
Q as a description of uncertainty about X). We typically write the loss as 
S(x, Q) in this case and refer to S as a scoring rule or score. Such scoring 
rules are used to assess the performance of probability forecasters [Dawid 
(1986)]. We say S is T-proper if T C Q C V and, for all P G T, the choice 
Q = P is Bayes against X ~ P. Then for PgT, 



Suppose now we start from a general decision problem, with loss function 
L such that Zq is nonempty for all Q G Q. Then we can define a scoring 
rule by 



where for each Q G Q we suppose we have selected some specific Bayes 
act Cq G Zq. Then for PeQ, S(P, Q) = L(P, (q) is clearly minimized when Q 
so that this scoring rule is Q-proper. If L is Q-semistrict, then (15) does not 
depend on the choice of Bayes act Cq- More generally, if L is Q-relatively 
strict, then S(P, Q) does not depend on such a choice, for all P,Q G Q. 



We see that, for Peg, inf QeQ S(P,Q) = S{P,P) = L(P,(p) = H(P). 



In particular, the generalized entropy associated with the constructed scor- 
ing rule (15) is identical with that determined by the original loss function 
L. In this way, almost any decision problem can be reformulated in terms 
of a proper scoring rule. 



(14) 



H(P) = S(P,P). 



(15) 



S(x,Q) :=L(x,Cq) 
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Fig. 1. Brier, log and zero-one entropies for the case X = {0, 1}. 

3.5. Some examples. We now give some simple examples, both to illus- 
trate the above concepts and to provide a concrete focus for later develop- 
ment. Further examples may be found in Dawid (1998) and Dawid and Sebastiani 
(1999). 

3.5.1. Brier score. Although it can be generalized, we restrict our treat- 
ment of the Brier score [Brier (1950)] to the case of a finite sample space 
X = {xi, . . . ,xjv}. A distribution P over X can be represented by its prob- 
ability vector p = (p(l), ■ ■ ■ ,p(N)), where p(x) := P(X = x). A point x £ X 
may also be represented by the A^-vector 5 X corresponding to the point-mass 
distribution on {x} having entries 5 x (j) = 1 if j = x, otherwise. The Brier 
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A 1 



£{<^(i)-<7(i)} 2 



3=1 



(17) 



£g(j) 2 - 2g(x) + 1. 



j 



Then 



(18) 



5(P,Q) = £g(i) 2 -2£p(j)c?(j) + l 



3 3 



which is uniquely minimized for Q = P, so that this is a "P-strict proper 
scoring rule. The corresponding entropy function is (see Figure 1) 



3.5.2. Logarithmic score. An important scoring rule is the logarithmic 
score, generalizing the discrete-case log loss as already considered in Sec- 
tion 2. For a general sample space X, let \i be a fixed cr-finite measure 
(the base measure) on a suitable cr-algebra in X, and take A to be the 
set of all finite nonnegative measurable real functions q on X for which 
/ q(x) dfi(x) = 1. Any q G A can be regarded as the density of a distribution 
Q over X which is absolutely continuous with respect to fj,. We denote the 
set of such distributions by M.. However, because densities are only defined 
up to a set of measure 0, different q's in A can correspond to the same 
Q G hA. Note moreover that the many-one correspondence between q and 
Q depends on the specific choice of base measure \i and will change if we 
change fi. 

We define a loss function by 



If (and only if) P G A4, then S(P,q) will be the same for all versions q of 
the density of the same distribution Q G M . Hence for P,Q G M we can 
write S(P,Q) instead of S(P,q), and we can consider S to be a scoring 
rule. It is well known that, for P,Q,Q* G M, E P {S(X,Q) - S(X,Q*)} = 
— J p(x) log{q(x) / q* (x)} dfi is nonnegative for all Q if and only if Q* = P. 
That is, Q* is Bayes against P if and only if Q* = P, so that this scoring 
rule is A^-strictly proper. 
We have, for P G M, 



(19) 



H(P) = l-Y / P(j) 2 - 



3 



(20) 



S(x,q) 



\ogq(x). 



(21) 
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the usual definition of the entropy of P with respect to [i. When X is dis- 
crete and fi is counting measure, we recover the Shannon entropy. For the 
simple case X = {0, 1} this is depicted in Figure 1. Note that the whole de- 
cision problem, and in particular the value of H{P) as given by (21), will 
be altered if we change (even in a mutually absolutely continuous way) the 
base measure /x. 

Things simplify when \i is itself a probability measure. In this case A 
contains the constant function 1. For any distribution P whatsoever, whether 
or not P G M, we have L(P, 1) = 0, whence we deduce H{P) < (with 
equality if and only if P = fi). When P G M, (21) asserts H(P) = -KL(P, fi), 
where KL is the Kullback-Leibler divergence [Kullback (1959)]. [Note that it 
is possible to have KL(P, //) = oo, and thus H{P) = — oo, even for P G A4.] 
If P ^ Ai, there exist a measurable set N and a > such that fi(N) = 
but P(N) = a. Define q n (x) = 1 (x <£ N), q n (x) =n (x G N). Then q n G 
A and L(P,q n ) = —a logn. It follows that H(P) = — oo. Since the usual 
definition [Csiszar (1975) and Posner (1975)] has KL(P, fx) = oo when P ^ fi, 
we thus have H(P) = — KL(P, fi) in all cases. This formula exhibits clearly 
the dependence of the entropy on the choice of [i. 

3.5.3. Zero-one loss. Let X be finite or countable, take A = X and con- 
sider the loss function 

(22) L(x a) = <; ' ifa = x, 

\ 1, otherwise. 

Then L(P, a) = 1 — P(X = a), and a nonrandomized Bayes act under P is 
any mode of P. When P has (at least) two modes, say ap and o'p, then 
L(x,ap) and L(x,a' P ) are not identical, so that this loss function is not 
P-semistrict. This means that we may have to take account of randomized 
strategies ( for DM. Then, writing ((x) := £(A = x), we have 

(23) L(x,0 = l-C(aO 
and 

(24) L(P,O = l-$>(zK04 

A randomized act £ is Bayes against P if and only if it puts all its mass on 
the set of modes of P. 

We have generalized entropy function 



(25) H(P) = l-p 



max; 



with p max := sup x£X p(x). For the simple case X = {0, 1}, this is depicted 
in Figure 1. 
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3.5.4. Bregman score. Suppose that #(<-f ) = N < oo and that we repre- 
sent a distribution P £ V over X by its probability mass function p S A, the 
unit simplex in 1Z , which can in turn be considered as a subset of (N — 1)- 
dimensional Euclidean space. The interior A° of A then corresponds to the 
subset Q C V of distributions giving positive probability to each point of X . 

Let H be a finite concave real function on A. For any q £ A°, the set 
VH(q) of supporting hyperplanes to H at q is nonempty [Rockafellar (1970), 
Theorem 27.3] — having a unique member when H is differentiable at q. 
Select for each q 6 A° some specific member of VH(q), and let the height of 
this hyperplane at arbitrary p S A be denoted by l q (p): this affine function 
must then have equation of the form 

(26) l q (p)=H{q) + a T q (p-q). 

Although the coefficient vector a q € 1Z X in (26) is only defined up to addition 
of a multiple of the unit vector, this arbitrariness will be of no consequence. 
We shall henceforth reuse the notation VH(q) in place of a q . 
By the supporting hyperplane property, 

(27) l q {p) > H(p), 

(28) l q (q)=H(q). 
Now consider the function S : X x Q defined by 

(29) S(x, Q) = H(q) + VH(q) T (5 x - q), 

where 5 X is the vector having 5 x (j) = 1 if j = x, otherwise. 

Then we easily see that S(P, Q) = l q (p), so that, by (27) and (28), S(P, Q) 
is minimized in Q when Q = P. Thus S is a Q-proper scoring rule. 

We note that 

0< d(P,Q) :=S(P,Q)-S(P,P) 

(30) 

= H(q)+VH(q) T (p-q)-H(p). 

With further regularity conditions (including in particular differentiabil- 
ity), (30) becomes the Bregman divergence [Bregman (1967), Csiszar (1991) 
and Censor and Zenios (1997)] associated with the convex function —H. We 
therefore call S, defined as in (29), a Bregman score associated with H. This 
will be unique when H is differentiable on A°. In Section 8 we introduce a 
more general decision-theoretic notion of divergence. 

We note by (28) that the generalized entropy function associated with this 
score is H*(P) = S(P, P) = l p (p) = H(p) (at any rate inside A°). That is to 
say, we have exhibited a decision problem for which a prespecified concave 
function H is the entropy. This construction can be extended to the whole of 
A and to certain concave functions H that are not necessarily finite [Dawid 
(2003)]. Extensions can also be made to more general sample spaces. 
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3.5.5. Separable Bregman score. A special case of the construction of 
Section 3.5.4 arises when we take H(q) to have the form — Y^x&x ^{lix)}, 
with tp a real-valued differentiable convex function of a nonnegative argu- 
ment. In this case we can take (S7H(q))(x) = — ip'{q(x)}, and the associated 
proper scoring rule has 

(31) S(x,Q) = -fj/{q(x)} - £h%(*)} -qm'{q(t)}}- 

tex 

We term this the separable Bregman scoring rule associated with ip. The 
corresponding separable Bregman divergence [confusingly, this special case 
of (30) is sometimes also referred to simply as a Bregman divergence] is 

(32) d^P,Q)=^A lj} {p(x),q(x)}, 

xex 

where we have introduced 

(33) A^a, b) := ij,(a) - 4(b) - i>'{b) (a - b). 

The nonnegative function Aw, measures how much the convex function ip 
deviates at a from its tangent at 6; this can be considered as a measure of 
"how convex" ip is. 

We can easily extend the above definition to more general sample spaces. 
Thus let X, fj,, A and M. be as in Section 3.5.2, and, in analogy with (31), 
consider the following loss function: 

(34) S(x, q) := -^'{q{x)} - J [^{q{t)} - q(t) i//{q(t)}] dfi(t). 

Clearly if q, q' are both /z-densities of the same Q G M, then S(x,q) = 
S(x,q') a.e. \p], and so, for any P £ M, S(P,q) = S(P,q'). Thus once again, 
for P,Q & M, we can simply write S(P, Q). We then have 

(35) S(P,Q) = J[{q(t)-p(t)}4'{q(t)}-?P{q(t)}}d»(t), 
whence 

(36) S(P,P) = - j ' 4{p{t)}dix{t), 
and so, if S(P,P) is finite, 

(37) d 1 p(P,Q):=S(P,Q)-S(P,P) = J ^{p(t),q(t)}d^(t). 

Thus, for P,Q£M, if S(P,P) is finite, S(P,P) < S(P,Q). Using the ex- 
tended definition (11) of Bayes acts, we can show that P is Bayes against P 
even when S(P,P) is infinite. That is, S is an A4 -proper scoring rule. If ip 
is strictly convex, S is .M-strict. 
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The quantity d^(P,Q) defined by (37) is identical with the (separable) 
Bregman divergence [Bregman (1967) and Csiszar (1991)] B^(p,q), based 
on ip (and //), between the densities p and q of P and Q. Consequently, we 
shall term S(x, q) given by (34) a separable Bregman score. For P £ A4 the 
associated separable Bregman entropy is then, by (36), 

(38) H f (P) = -JiP{ P (t)}d^t). 

The logarithmic score arises as a special case of the separable Bregman 
score on taking ip(s) = slogs; and the Brier score arises on taking \i to be 
counting measure and ifi(s) = s 2 — 1/N. 

3.5.6. More examples. Since every decision problem generates a gener- 
alized entropy function, an enormous range of such functions can be con- 
structed. As a very simple case, consider the quadratic loss problem, with 
X = A = TZ, L(x, a) = (x — a) 2 . Then ap = Ep(X) is Bayes against P, and 
the associated proper scoring rule and entropy are S(x,P) = {x — Ep(X)} 2 
and H(P) = varp(X) — a very natural measure of uncertainty. This cannot 
be expressed in the form (38), so it is not associated with a separable Breg- 
man divergence. Dawid and Sebastiani (1999) characterize all those gener- 
alized entropy functions that depend only on the variance of a (possibly 
multivariate) distribution. 

4. Maximum entropy and robust Bayes. Suppose that Nature may be 
regarded as generating X from a distribution P, but DM does not know P. 
All that is known is that P £ T, a specified family of distributions over X . 
The consequence DM faces if he or she takes act a S A when Nature chooses 
X = x is measured by the loss L(x,a). How should DM act? 

4.1. Maximum entropy. One way of proceeding is to replace the family 
r by some "representative" member P* € T, and then choose an act that 
is Bayes against P*. A possible criterion for choosing P* , generalizing the 
standard maximum Shannon entropy procedure, might be: 

Maximize, over P G T, the generalized entropy H(P). 

4.2. Robust Bayes rules. Another approach is to conduct a form of "ro- 
bust Bayes analysis" [Berger (1985)]. In particular we investigate the T- 
minimax criterion, a compromise between Bayesian and frequentist decision 
theory. For a recent tutorial overview of this criterion, see Vidakovic (2000). 

When X ~ P £ T, the loss of an act a is evaluated by L(P,a). We can 
form a new restricted game, G r = (T,A,L), where Nature selects a distribu- 
tion P from r, DM an act a from ^4, and the ensuing loss to DM is taken to 
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be L(P, a). Again, we allow DM to take randomized acts £ G 2, yielding loss 
L(P, Q when Nature generates X from P. In principle we could also let Na- 
ture choose her distribution P in some random fashion, described by means 
of a law (distribution) for a random distribution P over X . However, with 
the exception of Section 10, where randomization is in any case excluded, in 
all the cases we shall consider V will be convex, and then every randomized 
act for Nature can be replaced by a nonrandomized act (the mean of the law 
of P) having the identical loss function. Consequently we shall not consider 
randomized acts for Nature. 

In the absence of knowledge of Nature's choice of P, we might apply the 
minimax criterion to this restricted game. This leads to the prescription 
for DM: 

Choose £ = £* G 2, to achieve 

(39) infsu P L(P,C). 

We shall term any act Q* achieving (39) robust Bayes against T, or T- 
minimax. 

When the basic game is defined in terms of a Q-proper scoring rule 
S(x,Q), and T C Q, this robust Bayes criterion becomes: 
Choose Q = Q* , to achieve 

(40) inf sup S(P,Q). 

Note particularly that in this case there is no reason to require Q = T; we 
might want to take Q larger than T (typically, Q = V). Also, we have not 
considered randomized acts in (40) — we shall see later that, for the problems 
we consider, this has no effect. 

Below we explore the relationship between the above two methods. In 
particular, we shall show that, in very general circumstances, they produce 
identical results. That is, maximum generalized entropy is robust Bayes. 
This will be the cornerstone of all our results to come. 

First note that from (12) the maximum entropy criterion can be ex- 
pressed as: 

Choose P = P* , to achieve 

(41) supinfL(P,C). 

There is a striking duality with the criterion (39). 

In the general terminology of game theory, (41) defines the extended real 
lower value, 



(42) 



F:=sup inf L(P,(), 
PerC&z 
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and (39) the upper value, 

(43) F:= inf supL(P,C), 

of the restricted game Q T . In particular, the maximum achievable entropy is 
exactly the lower value. We always have V_ < V. When these two are equal 
and finite, we say the game Q r has a value, V := V_ = V . 

Definition 4.1. The pair (P*,(*) G T x Z is a saddle-point (or equilib- 
rium) in the game Q r if H* := L(P*,(*) is finite, and the following hold: 

(a) L(P*,(*)<L(P*,() forallCG^; 
1 ; (b) L(P*,C)>L(P,C) for all PET. 

In Sections 5 and 6 we show for convex T the existence of a saddle-point 
in Q T under a variety of broadly applicable conditions. 

In certain important special cases [see, e.g., Section 2.1, (3)], we may be 
able to demonstrate (b) above by showing that is an equalizer rule: 

Definition 4.2. Q G Z is an equalizer rule in Q r if L{P,Q is the same 
finite constant for all P G V. 

Lemma 4.1. Suppose that there exist both a maximum entropy distribu- 
tion P* G r achieving (42) ; and a robust Bayes act £* G Z achieving (43). 
TTien V_ < L(P*,£*) < V. If, further, the game has a value, V say, then 

V = H* := L(P* , £*) , and (P*,£*) is a saddle-point in the game Q r . 

PROOF. V = M ( L(P*, () < L(P*,(*), and similarly L(P*,Q*) < V. If 
the game has a value V, then L(P* , (* ) = V = ini (eZ L(P* , Q , and L(P*,(*) = 

V = sup Per L(P,C). □ 

Note that, even when the game has a value, either or both of P* and 
may fail to exist. 

Conversely, we have the following theorem. 

Theorem 4.1. Suppose that a saddle-point (P*,(*) exists in the game 
Q T . Then: 

(i) The game has value H* = L(P*,(*). 

(ii) C* is Bayes against P* . 

(iii) H(P*) = H*. 

(iv) P* maximizes the entropy H{P) over V . 

(v) is robust Bayes against T . 
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Proof. Part (i) follows directly from (44) and the definitions of V_, V. 
Part (ii) is immediate from (44) (a) and finiteness, and in turn implies (iii). 
For any P £ T, H(P) < L(P,(*) < H* by (44) (b). Then (iv) follows from 
(iii). For any sup P L(P,C) > L(P*,(), so that, by (44) (a), 

(45) sup L(P,()>H*. 

p 

Also, by (44) (b), 

(46) supL(P,C*) = #*. 

p 

Comparing (45) and (46), we see that £* achieves (39); that is, (v) holds. 

□ 

Corollary 4.1. Suppose that L is T -relatively strict, that there is a 
unique P* £ T maximizing the generalized entropy H and that Q* E Z is a 
Bayes act against P* . Then, if Q v has a saddle-point, is robust Bayes 
against V . 

Corollary 4.2. Let the basic game Q be defined in terms of a Q-strictly 
proper scoring rule S(x, Q), and letT C Q. If a saddle-point in the restricted 
game Q r exists, it will have the form (P*,P*). The distribution P* will then 
solve each of the following problems: 

(i) Maximize over P G T the generalized entropy H(P) = S(P,P). 

(ii) Minimize over Q S Q the worst-case expected score, supp gr S(P, Q). 

It is notable that, when Corollary 4.2 applies, the robust Bayes distribu- 
tion solving problem (ii) turns out to belong to V, even though this constraint 
was not imposed. 

We see from Theorem 4.1 that, when a saddle-point exists, the robust 
Bayes problem reduces to a maximum entropy problem. This property can 
thus be regarded as an indirect justification for applying the maximum en- 
tropy procedure. In the light of Theorem 4.1, we shall be particularly in- 
terested in the sequel in characterizing those decision problems for which a 
saddle-point exists in the game Q T . 

4.3. A special case. A partial characterization of a saddle-point can be 
given in the special case that the family T is closed under conditioning, in the 
sense that, for all P £ T and B C X a measurable set such that P(B) > 0, 
Pb, the conditional distribution under P for X given X G B, is also in 
r. This will hold, most importantly, when T is the set of all distributions 
supported on X or on some measurable subset of X. 

For the following lemma, we suppose that there exists a saddle-point 
(P*,£*) in the game Q r , and write H* = L(P*,(*). In particular, we have 
L(P, (*) < H* for all FeT. We introduce U := {x E X : L(x, (*) = H*}. 
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Lemma 4.2. Suppose that T is closed under conditioning and that P £ T 
is such that L(P, £*) = H* . Then P is supported on U. 

Proof. Take h < H*, and define B := {x G X : L(x, C*) < /t}, tt := P{B). 
By linearity, we have H* = L(P,(*) = irL(P B ,C) + (l-ir) L(P B °, C) (where B c 
denotes the complement of B). However, by the definition of B, L(P B ,(*) < 
h, while (if vr / 1) L(P B c,(*) < H*, by Definition 4.1(b) and the fact that 
Pb c £ r. It readily follows that ir = 0. Since this holds for any h < H* , we 
must have P{L(X, (*) > H*} = 1. However, E P {L(X, (*)} = L{P, (*) = £T*, 
and the result follows. □ 

Corollary 4.3. L(X,(*) = H* almost surely under P* . 

Corollary 4.4. If there exists P G T that is not supported on U, then 
is not an equalizer rule in Q r . 

Corollary 4.4 will apply, in particular, when T is the family of all distri- 
butions supported on a subset A of X and (as will generally be the case) A 
is not a subset of U. Furthermore, since T then contains the point mass at 
x £ A, we must have L(x, £*) < H* , all x £ A, so that U is the subset of A 
on which the function L(-, £*) attains its maximum. In a typical such prob- 
lem having a continuous sample space, the maxima of this function will be 
isolated points, and then we deduce that the maximum entropy distribution 
P* will be discrete (and the robust Bayes act will not be an equalizer 
rule). 

5. An elementary minimax theorem. Throughout this section we sup- 
pose that X = {xi, . . . ,xtv} is finite and that L is bounded. In particular, 
L(P,a) and H(P) are finite for all distributions P over X, and the set V of 
these distributions can be identified with the unit simplex in 1Z N . We endow 
V with the topology inherited from this identification. 

In this case we can show the existence of a saddle-point under some sim- 
ple conditions. The following result is a variant of von Neumann's original 
minimax theorem [von Neumann (1928)]. It follows immediately from the 
general minimax theorem of Corollary A.l, whose conditions are here read- 
ily verified. 

Theorem 5.1. Let T be a closed convex subset ofV. Then the restricted 
game Q T has a finite value H* , and the entropy H(P) achieves its maximum 
H* over T at some distribution P* G T. 

Theorem 5.1 does not automatically ensure the existence of a robust Bayes 
act. For this we impose a further condition on the action space. This involves 
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the risk-set S of the unrestricted game Q, that is, the convex subset of 
1Z N consisting of all points /(£) := {L{x\, Q, . . . , L(xjy, £)) arising as the risk 
function of some possibly randomized act £ G Z. 

Theorem 5.2. Suppose that T is convex, and that the unrestricted risk- 
set S is closed. Then there exists a robust Bayes act (* E Z. Moreover, there 
exists P* in the closure V ofT such that £* is Bayes against P* and (P*,C*) 
is a saddle-point in the game Q T . 

Proof. First assume T closed. By Theorem 5.1 the game G r has a finite 
value H*. Then there exists a sequence (£ n ) in Z such that bim^oo sup Pgr L(P, 
( n ) = inf^g^ supp gr L(P, C) = H* . Since S is compact, on taking a subse- 
quence if necessary we can find E Z such that l(Cn) —> KC*)- Then, for all 
QGT, 

(47) L(Q,C) = I™ L(Q,(n) < hm sup L{P, ( n ) = H* , 

n— >oo n— >c»pgp 

whence 

(48) su P L(P,C*) <-ff*. 

Per 

However, for P = P* , as given by Theorem 5.1, we have L{P* , £*) > H{P*) = 
H* , so that L(P*,C) = H*. The result now follows. 

If r is not closed, we can apply the above argument with T replaced 
by r to obtain and P* G f. Then suppL(P,C*) < sup r L(P,C), all 

C € Z. Since L(P, Q is linear, hence continuous, in P for all £, sup r L(P, C) = 
suppL(P, 0, and the general result follows. □ 

Note that S is the convex hull of So, the set of risk functions of nonran- 
domized acts. A sufficient condition for S to be closed is that Sq be closed. 
In particular this will always hold if A is finite. 

The above theorem gives a way of restricting the search for a robust Bayes 
act C*'- first find a distribution P* maximizing the entropy over F, then look 
for acts that are Bayes against P*. In some cases this will yield a unique 
solution, and we are done. However, as will be seen below, this need not 
always be the case, and then further principles may be required. 

5.1. Examples. 

5.1.1. Brier score. Consider the Brier score (16) for X = {0, 1} and V = 
V. Let H be the corresponding entropy as in (19). From Figure 1, or directly, 
we see that the entropy is maximized for P* having p*(0) =p*(l) = 1/2 . 
Since the Brier score is "P-strictly proper, the unique Bayes act against P* is 
P* itself. It follows that P* is the robust Bayes act against V. Hence in this 
case we can find the robust Bayes act simply by maximizing the entropy. 
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5.1.2. Zero-one loss. Now consider the zero-one loss (22) for X = {0, 1} 
and r = V . Let H be the corresponding entropy as in (25). From Fig- 
ure 1, or directly, we see that the entropy is again maximized for P* with 
p*(0) =p*(l) = 1/2. However, in contrast to the case of the Brier score, P* 
now has several Bayes acts. In fact, every distribution £ over A = {0,1} is 
Bayes against P* — yet only one of them (namely, £* = P*) is robust Bayes. 
Therefore finding the maximum entropy P* is of no help whatsoever in find- 
ing the robust Bayes act (* here. As we shall see in Section 7.6.3, however, 
this does not mean that the procedure described here (find a robust Bayes 
act by first finding the maximum entropy P* and then determine the Bayes 
acts of P*) is never useful for zero-one loss: if T ^ V, it may help in finding 
(* after all. 

6. More general minimax theorems. We are now ready to formulate 
more general minimax theorems. The proofs are given in the Appendix. 

Let (X,B) be a metric space together with its Borel a-algebra. Recall 
[Billingsley (1999), Section 5] that a family T of distributions on (X,B) is 
called (uniformly) tight if, for all e > 0, there exists a compact set C G B 
such that P(C) > 1 - e for all P G F. 

Theorem 6.1. Let rep be a convex, weakly closed and tight set of dis- 
tributions. Suppose that for each a£ A the loss function L(x,a) is bounded 
above and upper semicontinuous inx. Then the restricted game Q T = (r, A, L) 
has a value. Moreover, a maximum entropy distribution P* , attaining 

sup inf L(P, a), 
pgr a £-4 

exists. 

We note that if X is finite or countable and endowed with the discrete 
topology, then L(x,a) is automatically a continuous, hence upper semicon- 
tinuous, function of x. 

Theorem 6.1 cannot be applied to the logarithmic score, which is not 
bounded above in general. In such cases we may be able to use the theo- 
rems below. Note that these all refer to possibly randomized Bayes acts £*, 
but by Proposition 3.1 it will always be possible to choose such acts to be 
nonrandomized . 

Theorem 6.2. Let T CV be convex, and let P* ET, with Bayes act 
C* , be such that —oo < H(P*) = H* := supp gr H(P) < oo. Suppose that for 
all P G T there exists Po G V such that, on defining Q\ := (1 — A)Po + AP, 
the following hold: 

(i) P* = Q x * for some X* G (0, 1). 
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(ii) The function H(Q\) is differentiable at A = A* . 
Then (P*,(*) is a saddle-point in Q T . 

Theorem 6.2 essentially gives differentiability of the entropy as a condition 
for the existence of a saddle-point. This condition is strong but often easy to 
check. We now introduce a typically weaker condition, which may, however, 
be harder to check. 

Condition 6.1. Let (Q n ) be a sequence of distributions in T, with 
respective Bayes acts (( n ), such that the sequence (H(Q n )) is bounded below 
and (Q n ) converges weakly to some distribution Qq £ Vq. Then we require 
that Qo £ V, Qo has a Bayes act Co and, for some choice of the Bayes acts 
(Cn) and Co, L(P, Co) < hminf™ L(P, ( n ) for all P £ T. 

One would typically aim to demonstrate Condition 6.1 in its stronger 'T- 
free" form, wherein all mentions of T are replaced by V, or both T and V 
are replaced by some family Q with T C QCP. In particular, in the case of 
a Q-proper scoring rule S, Condition 6.1 is implied by the following. 

Condition 6.2. Let (Q n ) be a sequence of distributions in Q such that 
the sequence (H(Q n )) is bounded below and (Q n ) converges weakly to Qo- 
Then we require Qo £ Q and S(P, Qo) < liminf, woo S(P, Q n ) for all P £ Q. 

This displays the condition as one of weak lower semicontinuity of the 
score in its second argument. 

We shall further consider the following possible conditions on T: 

Condition 6.3. T is convex; every P £T has a Bayes act Cp and finite 
entropy H(P); and H* := supp 6r -ff (P) < oo. 

Condition 6.4. Furthermore, there exists P* £ T with H(P*) = H*. 

Theorem 6.3. Suppose Conditions 6.1, 6.3 and 6.4 hold. Then there 
exists C* £ Z such that (P*,C) is a saddle-point in the game Q r . 

If H(P) is not upper-semicontinuous or if Y is not closed in the weak 
topology, then supp gr H(P) may not be achieved. As explained in the Ap- 
pendix, for a general sample space these are both strong requirements. If 
they do not hold, then Theorem 6.3 will not be applicable. In that case we 
may instead be able to apply Theorem 6.4: 
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Theorem 6.4. Suppose Conditions 6.1 and 6.3 hold and, in addition, 
r is tight. Then there exists G Z such that 

(49) sup L(P, C) = inf sup L(P, () = sup inf L{P, a) = H* . 

In particular, the game Q T has value H* , and £* is robust Bayes against Y . 

In the Appendix we prove the more general Theorem A. 2, which implies 
Theorem 6.4. We also prove Proposition A.l, which shows that (under some 
restrictions) the conditions of Theorem A. 2 are satisfied when L is the log- 
arithmic score. 

The theorems above supply sufficient conditions for the existence of a 
robust Bayes act, but do not give any further characterization of it, nor do 
they assist in finding it. In the next sections we shall consider the important 
special case of V defined by linear constraints, for which we can develop 
explicit characterizations. 

7. Mean-value constraints. Let T = t(X), with t:X^lZ k , be a fixed 
real- or vector-valued statistic. An important class of problems arises on 
imposing mean-value constraints, where we take 

(50) r = T T :={PeV:E P (T)=T}, 

for some r G lZ k . This is the type of constraint for which the maximum 
entropy and minimum relative entropy principles have been most studied 
[Jaynes (1957a, b) and Csiszar (1975)]. 

We denote the associated restricted game (T T ,A,L) by Q T . We call T the 
generating statistic. 

In some problems of this type (e.g., with logarithmic score on a continuous 
sample space), the family T T will be so large that the conditions of the 
theorems of Section 6 will not hold. Nevertheless, the special linear structure 
will often allow other arguments for showing the existence of a saddle-point. 

7.1. Duality. Before continuing our study of saddle-points, we note some 
simple duality properties of such mean-value problems. 

Definition 7.1. The specific entropy function h:lZ k — ► [—00,00] (asso- 
ciated with the loss function L and generating statistic T) is defined by 

(51) h(r) := sup H{P). 

Per T 

In particular, if T T = 0, then h(r) = — 00. 



Now define T := {r G K k : h{r) > -00} and V* := {P G V :E P (T) G T}. 
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Lemma 7.1. The set T C 7£ fc is convex, and the function h is concave 
on T. 

Proof. Take t ,ti G T and A G (0,1), and let r A := (1 - A)t + An. 
There exist P ,Pi G 7> with Pj G T n and P~(P) > -oo, i = 0, 1. Let P A := 
(1 - A)P + APi. Then, for any a £ A, L(P,a) > if(Pj) > -oo, so that 
L(P A , a) = (1 - A)L(P ,a) + AL(Pi, a) is defined, that is, P A G V. Moreover, 
clearly P A G T Tx . We thus have h(r x ) > H(P X ) > (1 - X)H(P ) + Aif(Pi) > 
— oo. Thus r A G T; that is, T is convex. Now letting Pq and Pi vary inde- 
pendently, we obtain /i(r A ) > (1 — A)/i(to) + A/i(ri); that is, h is concave. 
□ 

For r G T, define 

(52) P r :=arg sup H(P) 

whenever this supremum is finite and is attained. It is allowed that P T is 
not unique, in which case we consider an arbitrary such maximizer. Then 
H(P T ) = h(r). By Theorem 4.1(iv), (52) will hold if (P r , £.) is a saddle-point 
in Q T . 

Dually, for f3 G lZ k , we introduce 

(53) Qp := argsup{ H(P) - /? T E P (T)}, 

whenever this supremum is finite and is attained. Again, Qp is not neces- 
sarily unique. For any such Qp we can define a corresponding value of r 
by ' 

(54) t = Eq p (T). 

Then Qp G r r , and on restricting the supremum in (53) to P G T T , we see 
that we can take Qp for P r in (52). More generally, we write r <-> /3 whenever 
there is a common distribution that can serve as both P T in (52) and Qp 
in (53) (in cases of nonuniqueness this correspondence may not define a 
function in either direction). 

It follows easily from (53) that, when r <-> j3, 

(55) h(a) - (3 T a < h(r) - pl T T, 
or equivalently 

(56) h(a) < h{r) +/3 T (a-r) 

for all a G T. Equation (56) expresses the fact that the hyperplane through 
the point (r, h(r)) with slope coefficients (3 is a supporting hyperplane to 
the concave function h : T — > 7£. Thus r and /3 can be regarded as dual 
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coordinates for the specific entropy function. In particular, if r (3 and h 
is differ entiable at r, we must have 

(57) (3 = h'(r). 

More generally, if t\ «-> /3i and T2 <-> /?2 , then on combining two applica- 
tions of (55) we readily obtain 

(58) (t 2 -ti) t (/3 2 -/3i)<0. 

In particular, when k = 1 the correspondence r <-> /? is nonincreasing in 
the sense that r 2 > t\ /3 2 < /?i • 

7.2. Linear loss condition. Theorem 7.1 gives a simple sufficient condi- 
tion for an act to be robust Bayes against T T of the form (50). We first 
introduce the following definition. 

Definition 7.2. An act £ £ Z is linear (with respect to loss function L 
and statistic T) if, for some (3q^1Z and = (/3i , . . . , (3k) T £ 7l k and all x £ X, 

(59) L(x,C) = /?o + /5 T t(x). 

A distribution P £ P is linear if it has a Bayes act £ that is linear. In this 
case we call (P, £) a linear pair. If Ep(T) = r is finite, we then call r a linear 
point of T. In all cases we call (Po,f3) the associated linear coefficients. 

Note that, if the problem is formulated in terms of a Q-strictly proper 
scoring rule S, and P £ Q, the conditions "P is a linear distribution," "P is 
a linear act" and "(P, P) is a linear pair" are all equivalent, holding when 
we have 

k 

(60) S(x,P) = f3 + Y,Pj t j( x ) 

3=1 

for all x £ X. 

Theorem 7.1. Let r £ T 6e linear, with associated linear pair (P T ,£ T ) 
and linear coefficients (Pq,(3). Let T T be given by (50). Then the following 
hold: 

(i) Cr *s an equalizer rule against T T . 

(ii) (P r ,£ T ) is a saddle-point in Q T . 
(hi) ( T is robust Bayes against T T . 

(iv) h(T) = H(P T )=p + f3 T T. 

(v) 
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Proof. For any P G V* we have 

(61) L(P,C t ) = /?o + /3 T Ep(T). 

By (61) L(P, C r ) = fa + (3 t t = L(P T ,( T ) for all P G V. Thus (44)(b) holds 
with equality, showing (i). Since L{P T ,Q T ) is finite and Ct is Bayes against 
P T , (44) (a) holds. We have thus shown (ii). Then (iii) follows from Theo- 
rem 4.1(v), and (iv) follows from Theorem 4.1 (i) , (iii) and (iv). For (v), we 
have from (61) that, for P G V* , 

(62) H(P) - /3 T E P (T) < L(P, ( T ) - (3 T E P (T) 

(63) =Pq 

(64) =H(P t )-P t E Pt (T) 
from (iv). Thus we can take Qp in (53) to be P T . □ 

Corollary 7.1. The same result holds if (59) is only required to hold 
with probability 1 under every P G T r . 

We now develop a partial converse to Theorem 7.1, giving a necessary 
condition for a saddle-point. This will be given in Theorem 7.2. 



Definition 7.3. A point r G T is regular if there exists a saddle-point 
(P T , Ct) in G T , and there exists (3 = (f3±, . . . , /3fc) T G 7£ fc such that: 

(i) P T can serve as Qp in (53) (so that r <-> /3). 

(ii) With C = Cr and (necessarily) 

(65) fa := /i(r) - /3 T r, 

the linear loss property (59) holds with P T -probability 1. 

If r satisfies the conditions of Theorem 7.1 or of Corollary 7.1 it will be 
regular, but in general the force of the "almost sure" linearity requirement 
in (ii) above is weaker than needed for Corollary 7.1. 

We shall denote the set of regular points of T by T r , and its subset of 
linear points by T l . For discrete X, r G T r will by (ii) be linear whenever 
P T gives positive probability to every x G X. More generally, as soon as we 
know r G T r , the following property, which follows trivially from (ii), can be 
used to simplify the search for a saddle-point: 

Lemma 7.2. If t is regular, the support X T of P T is such that, for some 
£eZ, L(x,C) is a linear function oft(x) on X T . 

The following lemma and corollary are equally trivial. 
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Lemma 7.3. Suppose r G T r . If P £ T T and P < P T , then L(P, &-) = 
h(r). 

Corollary 7.2. If t eT r and P < P r /or a// P G T r , i/ien Cr «s an 
equalizer rule in Q T . 

We now show that, under mild conditions, a point r in the relative inte- 
rior [Rockafellar (1970), page 44] T° of T will be regular. Fix r G T° and 
consider T T , given by (50). We shall suppose that there exists a saddle-point 
(P T )Ct) f° r the game £? T — this could be established by the theory of Sec- 
tion 5 or 6, for example. The value L(P T , £ T ) of the game will then be /i(r), 
which will be finite. 

Consider the function ijj T on T defined by 

(66) Vr(o"):= sup L(P,C r ). 

Per CT 

In particular, ip T (T~) = h{r). 

Proposition 7.1. tp T is finite and concave onT. 

Proof. For ueT there exists P £T a with H(P) > — oo; so ip T (a) > 
L(P,( T )>H{P) > -oo. 

Now take o"o,cri G T and A G (0, 1), and consider a\ := (1 — A)o"o + Xa\. 
Then r CTA D{(l-A)P + APi:P Gr CT0 ,P 1 Gr CT1 },so that ^ T (a x ) > (l-A)x 
V't(o'o) + A^ r (cri). Thus V't is concave on T. 

Finally, if t/v were to take the value +oo anywhere on T, then by Lemma 4.2.6 
of Stoer and Witzgall (1970) it would do so at r G 7~°, which is impossible 
since = h(r) has been assumed finite. □ 

For the proof of Theorem 7.2 we need to impose a condition allowing the 
passage from (70) to (71). For the examples considered in this paper, we can 
use the simplest such condition: 

Condition 7.1. For all x G X, t(x) G T. 

This is equivalent to t(X) C T, or in turn to T being the convex hull 
of t(X). For other applications (e.g., involving unbounded loss functions on 
continuous sample spaces) this may not hold, and then alternative conditions 
may be more appropriate. 

Theorem 7.2. Suppose that r G T° and (P r ,£ r ) is a saddle-point for 
the game Q T . If Condition 7.1 holds, then r is regular. 
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Proof. T is convex, t/j T :T — > 7£ is concave, and r G T°. The support- 
ing hyperplane theorem [Stoer and Witzgall (1970), Corollary 4.2.9] then 
implies that there exists j3 G 1Z k such that, for all a G T, 

(67) Vv(r) + /3 T (a-r)>^ T ((T). 
That is, for any P G P* , 

(68) fc(r) + /3 T {E P (T) - r} > ^ T {E P (T)}. 
However, for P G P*, 

(69) Vr{E P (T)} > L(P, C T ) > inf L(P, () = H(P). 
Thus, for all PG V* , 

h(r)+p T {E P (T)-r}>H(P), 

with equality when P = P T . This yields Definition 7.3(i). 
For (ii), (68) and (69) imply that 

(70) /i(r)-L(P,Cr) + /3 T {Ep(T)-r}>0 forallPGP*. 

Take x G X, and let P x be the point mass on x. By Condition 7.1, P x G V* , 
and so 

(71) h(r) - L(x, Cr) + P T {t{x) - r} > for all x G X. 
On the other hand, 

(72) E Pt [7»(r) - L(X, ( T ) + /3 T {t(X) - r}} = 0. 
Together (71) and (72) imply that 

(73) P T [h{T) - L(X, (r) + P T {t(X) - r} = 0] = 1. 
The result follows. □ 

7.3. Exponential families. Here we relate the above theory to familiar 
properties of exponential families [Barndorff-Nielsen (1978)]. 

Let /x be a fixed cr-finite measure on a suitable cr-algebra in X . The set 
of all distributions P fi having a /x-density p that can be expressed in the 
form 

(74) p(x) =exp|a + ^a i t : ,(x)| 

for all x G X is the exponential family £ generated by the base measure \x 
and the statistic T. 
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We remark that (74) is trivially equivalent to 

k 

(75) S(x,p)=(3 + Y,f3jt j (x), 

i=i 

for all x G X, where S is the logarithmic score (20), and f3j = —ctj. In par- 
ticular, (P,p) is a linear pair. 

Now under regularity conditions on /x and T [Barndorff-Nielsen (1978), 
Chapter 9; see also Section 7.4.1 below], for all r G T° there will exist a 
unique P r G T T n £; that is, P T has a density p T of the form (74), and 
Ep r (T) = r. Comparing (75) with (59), it follows from Theorem 7.1 that 
(as already demonstrated in detail in Section 2.1) (P T ,p T ) is a saddle-point 
in Q T . In particular, as is well known [Jaynes (1989)], the distribution P T 
will maximize the entropy (21), subject to the mean-value constraints (50). 
However, we regard this property as less fundamental than the concomitant 
dual property: that p T is the robust Bayes act under the logarithmic score 
when all that we know of Nature's distribution P is that it satisfies the 
mean- value constraint P G T T . Furthermore, by Theorem 7.1(i), in this case 
p T will be an equalizer strategy against T T [cf. (3)] . 

We remark that p T of the form (74) is only one version of the density 
for P T with respect to //; any other such density can differ from p T on a 
set of //-measure 0. However, our game requires DM to specify a density, 
rather than a distribution, and from this point of view certain other versions 
of the density of P T (which are of course still Bayes against P r ) will not 
do: they are not robust Bayes. For example, let X = 1Z, let /i = Lebesgue 
measure and consider the constraints Ep(X) = 0, Ep(X 2 ) = I. Let Po be 
the standard Normal distribution iV(0,l), and let po be its usual density 
formula, po(x) = (2-7r) -1//2 exp — \x 2 . Then the conditions of Theorem 7.1 
hold, Po is maximum entropy (as is well known) and the choice po for its 
density is robust Bayes against the set Tq of all distributions P — including, 
importantly, discrete distributions — that satisfy the constraints. This would 
not have been true if instead of po we had taken p' Q , identical with po except 
for p'q(x) =po(x)/2 at x = ±1. While p' is still Bayes against Po, its Bayes 
loss against the distribution in Tq that puts equal probability 1/2 at — 1 
and +1 exceeds the (constant) Bayes loss of po by log 2. Consequently, p' 
is not a robust Bayes act. It is in fact easy to see that a density p will be 
robust Bayes in this problem if and only if p(x) > po(x) everywhere (the set 
on which strict inequality holds necessarily having Lebesgue measure 0). 

We further remark that none of the theorems of Section 6 applies to the 
above problem. The boundedness and weak closure requirements of Theo- 
rem 6.1 both fail; condition (ii) of Theorem 6.2 fails; and although Condi- 
tion 6.2 holds, the existence of a Bayes act and finite entropy required for 
Condition 6.3 fail for those distributions in r r having a discrete component. 
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7.4. Generalized exponential families. We now show how our game-theoretic 
approach supports the extension of many of the concepts and properties of 
standard exponential family theory to apply to what we shall term a general- 
ized exponential family, specifically tailored to the relevant decision problem. 
Although the link to exponentiation has now vanished, analogues of familiar 
duality properties of exponential families [Barndorff-Nielsen (1978), Chap- 
ter 9] can be based on the theory of Section 7.1. 

Consider the following condition. 

Condition 7.2. For all teT, h(r) = sup PgrT H(P) is finite and is 
achieved for a unique P T £ T T . 

In particular, this will hold if (i) X is finite, (ii) L is bounded and (iii) 
H is strictly convex. For under (i) and (ii) Theorem 5.1 guarantees that a 
maximum generalized entropy distribution P T exists, which must then be 
unique by (iii). 

Under Condition 7.2 we can introduce the following parametric family of 
distributions over X: 

(76) r:={P T :reT}. 

We call £ m the full generalized exponential family generated by L and T\ 
and we call r its mean-value parameter. Condition 7.2 ensures that the map 
r i— ► P T is one-to-one. 

Alternatively, consider the following condition: 

Condition 7.3. For all (3 £ TZ k , sup PeV »{H(P) - (3 T E P (T)} is finite 
and is achieved for a unique distribution Qp S V* . 

Again, this will hold if, in particular, (i)-(iii) below Condition 7.2 are 
satisfied. 

Under Condition 7.3 we can introduce the parametric family 

(77) S n :={Q p :/3en k }. 

We call this family the natural generalized exponential family generated by 
the loss function L and statistic T; we call (3 its natural parameter. This 
definition extends a construction of Lafferty (1999) based on Bregman di- 
vergence: see Section 8.4.2. Note that in general the natural parameter (3 in 
£ n need not be identified; that is, the map (3 i— > Qn may not be one-to-one. 
See, however, Proposition 7.2, which sets limits to this nonidentifiability. 

From this point on, we suppose that both Conditions 7.2 and 7.3 are 
satisfied. For any j3 G TZ k , (54) yields r G T with r <-> (3, that is, P T = Qp. It 
follows that £ n C£ m . 
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We further define £ r := {P T :r G T r }, the regular generalized exponential 
family, and £ := {P T -T G 7" }, the linear generalized exponential family, 
generated by L and T. Then £ 1 C £ r C £ m . In general, £^ may be a proper 
subset of £ r : then for P T G £ r \ 5' we can only assert the "almost sure linear 
loss" property of Lemma 7.2. 

The following result follows immediately from Definition 7.3(ii). 

Proposition 7.2. IfQ Pl = Q /32 =Qe S r , then (ft - ft) T T = almost 
surely under Q. 

For t G T r choose /3 as in Definition 7.3. Then r <-»• ft and it follows 
that £ r C £ n . We have thus demonstrated the following. 

Proposition 7.3. Mien Conditions 7.2 and 7.3 6o£/i app/y, 

f c £: n c £: m . 

Now consider £° := {P T : r G T }, the open generalized exponential family 
generated by L and T. From Theorem 7.2 we have the following: 

Proposition 7.4. Suppose Conditions 7.1-7.3 all apply and a saddle- 
point exists in Q T for all r G 7~°. Then 

(78) s° c r crcr. 

7.4.1. Application to standard exponential families. We now consider 
more closely the relationship between the above theory and standard ex- 
ponential family theory. 

Let £* be the standard exponential family (74) generated by some base 
measure [i and statistic T. Taking as our loss function the logarithmic 
score S, (75) shows that £ C £* (distributions in £ * \ £ l being those for 
which the expectation of T does not exist). We can further ask: What is 
the relationship between £* and £ n ? As a partial answer to this, we give 
sufficient conditions for £*, £ l and £ n to coincide. 

For = (ft, . . . , ft) G K k , define 

(79) k{(5) := log J e"^) dfi, 

(80) xW ■= sup {H(P) - /3 T E P (T)}. 

Let B denote the convex set {(3 G !Z k :n((3) < oo}, and let £>° denote its 
relative interior. For (3 G B, let Q*^ be the distribution in £ * with /x-density 

q%(x) := exp{— k(/3) — (3 T t(x)}, and let Q^, if it exists, achieve the supremum 
in (80). 
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Proposition 7.5. (i) For all (3 e B°, the act qp is linear, and Qp = Qp 
uniquely. Moreover, x(P) = K (P)- 

(ii) IfB = lZ k , then Condition 7.3 holds and £*=£ l =£ n . 

(iii) // Condition 7.3 holds, B is nonempty and £* is minimal and steep, 
then B = TZ k and £* = £ l = £ n . 

[Note that the condition for (ii) will apply whenever the sample space X 
is finite.] 

Proof of Proposition 7.5. Linearity of the act qp (/? G B) is immedi- 
ate, the associated linear coefficients being (Po,P) with (3q = k((3). Suppose 
PgB°. Then t:=Eq*(T) exists [Barndorff-Nielsen (1978), Theorem 8.1]. 

We may also write P T for Qp. Then r is a linear point, with (P T ,p T ) the 
associated linear pair. By Theorem 7.1(iv) = H(P T ) - (3 t t. Also, by 
Theorem 7.1(v) we can take P T = Qp as Qp. The supremum in (80) thus 
being achieved by P T , we have x(P) = H(P T ) — (3 t t = k((3). 

To show that the supremum in (80) is achieved uniquely at Qp, note that 
any P achieving this supremum must satisfy 

H{P) - (3 T E P (T) = H(Q*p) - (3 t Eq* (T) 

(81) 

= K(P) = S(P,q})-(3 T E P (T), 

the last equality deriving from the definition of q*g. It follows that S(P, q^) = 
H(P) = S(P,p), whence J log{p(x) / qg(x)}p(x) dfi = 0. However, this can 
only hold if P = Q* p . 

Part (ii) follows immediately. 

For part (iii), assume Condition 7.3 holds. Then, for all (5 S lZ k , 

(82) X {fi) = sup sup {H(P) - /3 T r} = sup{/i(r) - /3 T r}, 

tsT Per T rer 

with h(r) as in (51). By Lemma 7.1 T is convex. It follows that x is a closed 
convex function on 1Z k . 

Steepness of £* means that |«(/3 n )| ~~ * 00 whenever (fi n ) is a sequence 
in B° converging to a relative boundary point j3* of B. Since k is convex 
[Barndorff-Nielsen (1978), Chapter 8] and x coincides with k on B°, we 
must thus have |x(/3 n )| — > oo as (/3 n ) — > (3* . Since by Condition 7.3 the closed 
convex function x is finite on 1Z k , B cannot have any relative boundary 
points — hence, under minimality, any boundary points — in lZ k . Since B is 
nonempty, it must thus coincide with 1Z k . Then, by (ii) £* =£ l = £ n . □ 

To see that even under the above conditions we need not have £* = £ m , 
consider the case ^f = {0,l}, T = X . Then £ m consists of all distributions 
on X , whereas £* = £ l = £ n excludes the one-point distributions at and 1. 
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7.4.2. Characterization of specific entropy. We now generalize a result 
of Kivinen and Warmuth (1999). For the case of finite X , they attack the 
problem of minimizing the Kullback-Leibler discrepancy KL(P, Pq) over all 
P such that Ep(T) = 0. Equivalently (see Section 3.5.2), they are maximiz- 
ing the entropy H{P) = — KL(P, Pq), associated with the logarithmic score 
relative to base measure Po, subject to P G Tq. 

Let £* be the standard exponential family (74) generated by base mea- 
sure Po and statistic T, with typical member Q*g (/3 6 lZ k ) having probability 
mass function of the form 

(83) q* p (x) =p {x)e- K ^- f3Tt ^ 

and entropy h(r) = k((3) + (3 t t, where r = Eq„(T). 

Suppose G 7~°. By Chapter 9 of Barndorff-Nielsen (1978), there then 
exists within r a unique member Q^* of £*. By Theorem 7.1 the maximum 
of the entropy — KL(P, Pq) is achieved for P = Q*p*] its maximized value is 
thus h(0) = where 

(84) K (/3) = log^p (x)e- /3Tt W. 

X 

Equation (1.5) of Kivinen and Warmuth (1999) essentially states that the 
maximized entropy h(0) over Tq can equivalently be obtained as 

(85) h(0) = min k(3). 

By Proposition 7.5 (i) this can also be written as 

(86) h(0) = min x(P)- 

We now extend the above property to a more general decision problem, 
satisfying Conditions 7.2 and 7.3. Let r <-> f3, a <-> 7 (t, cr G T). Then = 
/?o = h(r) — /3 t t, x{l) =7o = h(a) — j T a, with (3q, and correspondingly 70, 
as in (65). From (56) we have 

(87) h(a) <Pq + fa. 

Moreover, we have equality in (87) when /3 = 7. It follows that for a G T 

(88) fc(a)= inf {x(P)+p T a}, 

f3en k 

the infimum being attained when (3 a. In particular, when G T we re- 
cover (86) in this more general context. Equations (82) and (88) express 
a conjugacy relation between the convex function x an d the concave func- 
tion h. 
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7.5. Support. Fix x G X. For any act ( E Z we term the negative loss s x (() ■= 
—L(x,C) the support for act £ based on data x. Likewise, sp(C) := —L(P,C) 
is the support for £ based on a (theoretical or empirical) distribution P for 
X. If J 7 CZ is & family of contemplated acts, then the function £ i— ► sp(C) on 
J 7 is the support function over JF based on "data" P. When the maximum of 
sp(C) over £ £ J 7 is achieved at £ G we may term £ the maximum support 
act (in J 7 , based on P). Then £ is just the Bayes act against P in the game 
with loss function L(x,Q, when £ is restricted to the set J- . 

For the special case of the logarithmic score (20), s x (q) =logq(x) is the 
log-likelihood of a tentative explanation q(-), on the basis of data x; if P is 
the empirical distribution formed from a sample of n observations, sp(q) is 
in" 1 times) the log-likelihood for the explanation whereby these were inde- 
pendently and identically generated from density q(-). Thus our definition 
of the support function generalizes that used in likelihood theory [Edwards 
(1992)], while our definition of maximum support act generalizes that of 
maximum likelihood estimate. In particular, maximum likelihood is Bayes 
in the sense of the previous paragraph. 

Typically we are only interested in differences of support (between dif- 
ferent acts, for fixed data x or distribution P), so that we can regard this 
function as defined only up to an additive constant; this is exactly anal- 
ogous to regarding a likelihood function as defined only up to a positive 
multiplicative constant. 

7.5.1. Maximum support in generalized exponential families. Let T = 
t(X) be a statistic, and let E r be the regular generalized exponential family 
generated by L and T. Fix a distribution P* over X , and consider the as- 
sociated support function s*(-) := sp* (•) over the family T r := {Cr : t G T r }. 
It is well known [Barndorff-Nielsen (1978), Section 9.3] that, in the case 
of an ordinary exponential family (when L is logarithmic score and T r = 
{Pr(-) -t E T 7 *} is the set of densities of distributions in S r ), the likelihood 
over T r based on data x* (or more generally on a distribution P*) is under 
regularity conditions maximized at p T *, where r* = t(x*) [or r* = Ep*(T)]. 
The following result gives a partial generalization of this property. 

Theorem 7.3. Suppose t* := E P *(T) G T r . Let r G T r be such that 
either of the following holds: 

(i) Ct is linear; 

(ii) P*<.P T . 

Then 



(89) 



S*((r*)>S*((r). 
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Proof. Since P* E T T * and (P T *,£ r *) is a saddle-point in Q T , we have 

(90) s*(( T *) > -h(r*). 

Under (i), (59) holds everywhere; under (ii), by Definition 7.3(h) it holds 
with P* -probability 1. In either case we obtain 

(91) L(P*,Cr) = h(r)+(i T (T*-T). 

By (56), the right-hand side is at least as large as h(r*), whence s*(( T ) < 
—h(r*). Combining this with (90), the result follows. □ 

Corollary 7.3. If for all r E £ r either ( T is linear or P* P T , then 
£ T * is the maximum support act in T T . 

For the case of the logarithmic score (20) over a continuous sample space, 
with P* a discrete distribution (e.g., the empirical distribution based on a 
sample), Theorem 7.3(h) may fail, and we need to apply (i). For this we must 
be sure to take as the Bayes act p(-) against P £ £ the specific choice where 
(74) holds everywhere (rather than almost everywhere). Then Corollary 7.3 
holds. 

See Section 7.6.1 for a case where neither (i) nor (ii) of Theorem 7.3 
applies, leading to failure of Corollary 7.3. 

7.6. Examples. We shall now illustrate the above theory for the Brier 
score, the logarithmic score and the zero-one loss. In particular we analyze 
in detail the simple case having X = {— 1,0,1} and T = X . For each decision 
problem we (i) show how Theorems 7.1 and 7.2 can be used to find robust 
Bayes acts, (ii) give the corresponding maximum entropy distributions and 
(hi) exhibit the associated generalized exponential family and specific en- 
tropy function. 

7.6.1. Brier score. Consider the Brier score for X = {x±, . . . ,xn} ■ By 
(17) we may write this score as 

S(x,Q) = l-2q(x)+Y,lU) 2 - 

j 

To try to apply Theorem 7.1 we search for a linear distribution P T E T T . 
That is, we must find such that, for all x E X, 

k 

(92) 1 - 2 Pt (x) +Y,Pr(v) 2 = 00 + E 

y j=i 
Equivalently, we must find (ay) such that, for all x, 

k 

(93) p T (x) = a + ^2ajtj(x). 

3=1 
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The mean-value constraints 

^t j {x)p T {x)=T j , j = l,...,k, 

x 

together with the normalization constraint 

X 

will typically determine a unique solution for the k + 1 coefficients (ctj) 
in (93). As long as this procedure leads to a nonnegative value for each 
p T (x), by Theorem 7.1 and the fact that the Brier score is proper we shall 
then have obtained a saddle-point (P T ,P T ). 

However, as we shall see below, for certain values of r this putative "so- 
lution" for P T might have some p T (x) negative — showing that it is simply 
not possible to satisfy (92). By Theorem 5.2 we know that, even in this case 
a saddle-point (P T ,P T ) exists. We can find it by applying Theorem 7.2: we 
first restrict the sample space to some X* C X and try to find a probabil- 
ity distribution P T satisfying the mean- value and normalization constraints, 
such that p T (x) = for x ^ X* and for which, for some {(5j) (92) holds for 
all x G X* [or, equivalently, for some (ay) (93) holds for all x G X*]. Among 
all such restrictions X* that lead to an everywhere nonnegative solution for 
(p T (x)), we choose that yielding the largest value of H. Then the resulting 
distribution P T will supply a saddle-point and so, simultaneously, (i) will 
have H{P T ) = /i(r), the maximum possible generalized entropy 1 — Y^xP( x ) 2 
subject to the mean- value constraints, and (ii) (which we regard as more 
important) will be robust Bayes for the Brier score against all distributions 
satisfying that constraint. 

A more intuitive and more efficient geometric variant of the above proce- 
dure will be given in Section 8. 

Example 7.1. Suppose X = {— 1,0, 1} and T = X. Consider the con- 
straint E(A) = t, for t G [—1,1]. We first look for linear acts satisfying 
(93). The mean-value constraint ^l x xp T (x) = r and normalization con- 
straint J2xPr( x ) — 1 provide two independent linear equations for the co- 
efficients (ao,ai) in (93), so uniquely determining (ao,ai), and hence p T . 
We easily find ao = g, ol\ = |r and thus p T {x) = 3 + \tx (x = —1,0,1) 
(whence f3\ = — r, /?o = § + I 1 " 2 )- We thus obtain a nonnegative solution for 
(p T (— l),p T (0),p T (l)) only so long as r G [—2/3,2/3]: in this and only this 
case the act p T is linear. When r falls outside this interval we can proceed by 
trying the restricted sample spaces { — 1}, {0}, {1}, {0,1}, {—1,0}, {—1,1}, 
as indicated above. All in all, we find that the optimal distribution P T has 
probabilities, entropy and (5 satisfying Definition 7.3, as given in Table 1. 
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The family {P T : — 1 < r < 1} constitutes the regular generalized exponen- 
tial family over X generated by the Brier score and the statistic T = X. The 
location of this family in the probability simplex is depicted in Figure 2. 

We note that h(r) = (3 + f3\T and 0\ = /i'(r) (-1 < r < 1). The function 
/i(r) is plotted in Figure 3; Figure 4 shows the correspondence between (3\ 
and r. 

By Theorem 7. 1 (i) , the robust Bayes act P T will be an equalizer rule when 
r is linear, that is, for r G [— |, |], and also (trivially) when r = ±1. 

The above example demonstrates the need for condition (i) or (ii) in The- 
orem 7.3 and Corollary 7.3: typically both these conditions fail here for r ^ 
[-§,§]. Thus let P* have probabilities (p*(-l),p*(0),p*(l)) = (0.9,0,0.1), so 
that t* =E P *(X) = -0.8 and ( T * = (0.8,0.2,0). From (18) we find s*{( T *) = 
—0.24. However, £ r * = C-o.8 is n °t the maximum support act in T r in this 
case: it can be checked that this is given by C-o.95 = (0.95,0.05,0), having 
support s*(Cr) = -0.195. 

7.6.2. Log loss. We now specialize the analysis of Section 7.3 to the case 
X = {— 1, 0, 1}, T = X, with ii counting measure. 

For r£ (-1,1), the maximum entropy distribution P T will have (robust 
Bayes) probability mass function of the form p T (x) = exp — (/?o + fiix). That 
is, the probability vector p T = (p T (— l),p r (0),p r (l)) will be of the form 
(pe^ 1 ,p,pe~@ 1 ), subject to the normalization and mean- value constraints 

(94) p{l + e Pl +e~ Pl ) = 1, 

(95) p{e~ Pl -e Pl ) = r, 

which uniquely determine p G (0, 1), f3\ G 1Z. Then h(r) = (3q + (3±t, where 
0o = -logp. 
We thus have 

(96) p= (l + e ft +e- ft )- 1 , 

(97) T = p(e-> 31 -e A ), 

(98) h = - logp + (5ir. 

Table 1 

Brier score: maximum entropy distributions 
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Fig. 2. Brier score, logarithmic score and zero-one loss: the probability simplex for 
X = {— 1,0, 1}, with entropy contours and generalized exponential family (maximum en- 
tropy distributions for the constraint E(X) = r, r € [—1, 1]). The set of distributions sat- 
isfying E(X) — t corresponds to a vertical line intersecting the base at t; this is displayed 
for t = —0.25 and r = 0.75. The intersection of the bold curve and the vertical line corre- 
sponding to r represents the maximum entropy distribution for constraint E(A) = r. 



On varying (3\ in (— oo, oo) , we obtain the parametric curve (r, h) displayed in 
Figure 3; Figure 4 displays the correspondence between (5\ and r. It is readily 
verified that dh/dr = (dh / d(3\) / {dr / d(3\) = Pi, in accordance with (57). 

In the terminology of Section 7.4, the above family {P T :r G (0, 1)} con- 
stitutes the natural exponential family associated with the logarithmic score 
and the statistic T. It is also the usual exponential family for this problem. 
However, the full exponential family further includes r = ±1. The family 
Ti consists of the single distribution Pi putting all its mass on the point 1. 
Then trivially Pi is maximum entropy [with specific entropy h(l) = 0], and 
pi = (0, 0, 1), with loss vector L(-,pi) = (oo, oo, 0), is unique Bayes against Pi 
and robust Bayes against Yi. Clearly (59) fails in this case, but even though 
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-1 1 



Fig. 3. Specific entropy function h(r) for Brier score, logarithmic score and zero-one 
loss. 

t = 1 is not regular the property of Lemma 7.2 does hold there (albeit triv- 
ially). Similar properties apply at r = — 1. 

7.6.3. Zero-one loss. We now consider the zero-one loss (22) and seek 
robust Bayes acts against mean- value constraints T T of form (76). Once 
again we can try to apply Theorem 7.1 by looking for an act ( T E Z that is 
Bayes against some P T € T T , and such that 



(99) 



k 

L(X, Cr) = 1 - Cr(aO = A) + J2 & l ^ X ) 
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FlG. 4. Correspondence between mean-value parameter r (x-axis) and natural parame- 
ter /3i (y-axis) of generalized exponential family, for Brier score, logarithmic score and 
zero-one loss. 

for all x G X . When this proves impossible, we can again proceed by restrict- 
ing the sample space and using Theorem 7.2. The distribution P T will again 
maximize the generalized entropy. However, in this problem, in contrast to 
the log and Brier score cases, because of nonsemistrictness the Bayes act 
against P T may be nonunique — and, if we want to ensure that (99) (or its 
restricted version) holds, it may matter which of the Bayes acts (including 
randomized acts) we pick. Thus the familiar routine "maximize the general- 
ized entropy, and then use a Bayes act against this distribution" is not, by 
itself, fully adequate to derive the robust Bayes act: additional care must be 
taken to select the right Bayes act. 
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Example 7.2. Again take A" = {— 1,0, 1} and T = X. Consider the con- 
straint E(X) = r, where r £ [—1, 1]. We find that for each r a unique maxi- 
mum entropy P T exists. By some algebra we can then find the probabilities 
(p T (—l),Pr(0),p T (l)); they are given in Table 2, together with the corre- 
sponding specific entropy /i(r) (also plotted in Figure 3). 

The family of distributions {P T : r £ [—1,1]} thus constitutes the full gen- 
eralized exponential family over X generated by the zero-one loss and the 
statistic T = X . The location of this family in the probability simplex is 
depicted in Figure 2. 

How can we determine the robust Bayes acts Q T 1 We know that any such 
Q T is Bayes against P T and thus puts all its mass on the modes of P T . As 
can be seen, for —0.5 < r < 0.5 the set Ap T of these modes has more than 
one element. We additionally use (99), restricted to x in the support of P T , 
to find out which ( T £ Ap T are robust Bayes. For r £ [— I, ^] this requires 

-A+A) = l-Cr(-1), 

(100) A) = i-Cr(o), 

/3 1 +/3 = 1-Cr(l), 

from which we readily deduce (3q = | . The condition that £ T put all its mass 
on the modes of P T then uniquely determines £ T for — 0.5 < r < and for 
< r < 0.5. If t = 0, all acts C are Bayes for some P £ T r (take P uniform), 
and hence by Theorem 7.1 all solutions to (100) [i.e., such that Cr(0) = |] are 
robust Bayes acts. Finally, for r = 0.5 (the case r = —0.5 is similar) we must 
have Ct(— 1) = 0, and we can use the "supporting hyperplane" property (56) 
to deduce that Cr(0) < |. 

Table 3 gives the robust Bayes acts Cr for each r £ [—1,1], together with 
the corresponding values of Poifii- Thus ( T is a linear act for —0.5 < r < 0.5 



Table 2 

Zero-one loss: maximum entropy distributions 
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(where we must choose a = \ at the endpoints). Again we see that h(r) = 
00 + ^11", and that (3\ = /i'(t) where this exists. 

Figure 4 shows the relationship between /?i and r. In this case the unique- 
ness part of Condition 7.3 is not satisfied, with the consequence that neither 
f3\ nor r uniquely determines the other. However, the full exponential family 
{P T :— 1 < t < 1} is clearly specified by the one-one map r i— > P T , and most 
of the properties of such families remain valid. 

8. Relative entropy, discrepancy, divergence. Analogous to our gener- 
alized definition of entropy, we now introduce generalized relative entropy 
with respect to a decision problem, and we show how the negative relative 
entropy has a natural interpretation as a measure of discrepancy. This allows 
us to extend our minimax results to a more general setting and leads to a 
generalization of the Pythagorean property of the relative Shannon entropy 
[Csiszar (1975)]. 

We first introduce the concept of the discrepancy between a distribution 
P and a (possibly randomized) act (, induced by a decision problem. 

8.1. Discrepancy. Suppose first that H(P) is finite. We define, for any 
£ £ 2, the discrepancy D(P, £) between the distribution P and the act £ by 

(101) D(P,():=L(P,()-H(P). 

In the general terminology of decision theory, D[P, £) measures DM's regret 
[Berger (1985), Section 5.5.5] associated with taking action £ when Nature 
generates X from P. Also, since —D(P, () differs from —L(P, £) by a term 
only involving P, we can use it in place of the support function sp(£) : thus 
maximizing support is equivalent to minimizing discrepancy. 
We note that, if a Bayes act (p against P exists, then 

(102) D(P,() = Ep{L(X,0-L(X,(p)}. 
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Zero-one loss: robust Bayes acts 
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We shall also use (102) as the definition of D(P,() when P $ V, or H(P) 
is not finite, but P has a Bayes act (in which case it will not matter which 
such Bayes act we choose). This definition can itself be generalized further 
to take account of some cases where no Bayes act exists; we omit the details. 
The function D has the following properties: 

(i) D(P,C)e[0,oo]. 

(ii) D(P, C) = if and only if £ is Bayes against P. 

(iii) For any a, a' £ A, D{P, a) — D(P, a') is linear in P (in the sense of 
Lemma 3.2). 

(iv) D is a convex function of P. 

Conversely, under regularity conditions any function D satisfying (i)-(iii) 
above can be generated from a suitable decision problem by means of (101) 
or (102) [Dawid (1998)]. 

8.1.1. Discrepancy and divergence. When our loss function is a Q-proper 
scoring rule S, we shall typically denote the corresponding discrepancy func- 
tion by d. Thus for P,QeQ with H(P) finite, 

(103) d(P,Q)=S(P,Q)-H(P). 

We now have d(P, Q) > 0, with equality when Q = P; if S is Q-strict, 
then d(P, Q) > for Q / P. Conversely, if for any scoring rule 5, S(P, Q) — 
S(P,P) is nonnegative for all P,Q S Q, then the scoring rule S is Q-proper. 
We refer to d(P,Q) as the divergence between the distributions P and Q. 
As we shall see in Section 10, divergence can be regarded as analogous to a 
measure of squared Euclidean distance. 

The following lemma, generalizing Lemmas 4 and 7 of Tops0e (1979), 
follows easily from (103) and the linearity of S(P, Q) in P. 

Lemma 8.1. Let S be a proper scoring rule, with associated entropy 
function H and divergence function d. Let Pi, . . . , P n have finite entropies, 
and let (pi, . . . ,p n ) be a probability vector. Then 

(104) H{P)=Y J PiH{Pi)+Y J Pid{Pi,P), 

(105) d(P, Q) = 5> d(P, Q)-Y J P l d(Pi,P), 
where P :=J2PiPi- 

We can also associate a divergence with a more general decision problem, 
with loss function L such that Zq is nonempty for all Q £ Q, by 



(106) 



d(P, Q) := D(P, Cq) = E P {L(X, Cq) - L(X, Cp)}, 
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where again for each Q G Q we suppose we have selected some specific Bayes 
act (q. This will then be identical with the divergence associated directly 
[using, e.g., (103)] with the corresponding scoring rule given by (15), and 
(104) and (105) will continue to hold with this more general definition. 

8.2. Relative loss. Given a game Q = (X,A, L), choose, once and for all, 
a reference act Co £ Z. We can construct a new game Qo = (X, A, Lq), where 
the new loss function Lq is given by 

(107) L (x,a) := L(x,a) - L(x,(o)- 

This extends naturally to randomized acts: Lo(x, C) := L(x, () — L(x, Co)- We 
call Lo the relative loss function and Go the relative game with respect to the 
reference act Co- I n order that Lq > — oo we shall require L(x, Co) < oo for all 
x £ X. We further restrict attention to distributions in V 1 := {P : Lq(P, a) is 
defined for all a € ^4} and randomized acts in Z' := {(, : Lo(P, is defined 
for all P E V'}. In general, V' and Z' may not be identical with V and Z. 
The expected relative loss Lq(P, C) satisfies 

(108) L (P, C) = L(P, C) - L(P, Co) 

whenever L(P, Co) is finite. Whether or not this is so, it is easily seen that 
the Bayes acts against any P are the same in both games. 

Definition 8.1. An act Co £ Z is called neutral if the loss function 
L(x,Co) is a finite constant, k say, on X. 

If a neutral act exists, and we use it as our reference act, then Lq(P, C) = 
L(P, C) — k, all P G V . The relative game Qq is then effectively the same 
as the original game Q, and maximum entropy distributions, saddle-points, 
and other properties of the two games, or of their restricted subgames, will 
coincide. However, these equivalences are typically not valid for more general 
relative games. 

8.3. Relative entropy. When a Bayes act Cp against P exists, the gener- 
alized relative entropy Hq(P) :=mf ae ^LQ(P,a) associated with the relative 
loss Lo is seen to be 

(109) Ho(P) = Ep{L(X,(p)-L(X,(o)}. 

[In particular, we must have — oo < Ho(P) < 0.] When L{P, Co) is finite, 
this becomes 

(HO) H (P) = H(P)-L(P,( ). 

Comparing (109) with (102), we observe the simple but fundamental relation 
(HI) Ho(P) = -D(P,(o). 
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The maximum generalized relative entropy criterion thus becomes identi- 
cal to the minimum discrepancy criterion: 

Choose P € r to minimize, over FsT, its discrepancy D(P, Co) from the 
reference act Co- 

Note that, even though Bayes acts are unaffected by changing from L to 
the relative loss Lq, the corresponding entropy function (110) is not unaf- 
fected. Thus in general the maximum entropy criterion (for the same con- 
straints) will deliver different solutions in the two problems. Related to this, 
we can also expect to obtain different robust Bayes acts in the two problems. 

Suppose we construct the relative loss taking as our reference act Co a 
Bayes act against a fixed reference distribution Pq. Alternatively, start with 
a proper scoring rule S, and construct directly the relative score with ref- 
erence to the act Pq. The minimum discrepancy criterion then becomes the 
minimum divergence criterion: choose P € T to minimize the divergence 
d(P,Po) from the reference distribution Po. 

This reinterpretation can often assist in finding a maximum relative en- 
tropy distribution. If moreover we can choose Po to be neutral, this minimum 
divergence criterion becomes equivalent to maximizing entropy in the origi- 
nal game. 

8.4. Relative loss and generalized exponential families. 

8.4.1. Invariance relative to linear acts. Suppose the reference act Co is 
linear with respect to L and T, so that we can write 

(112) L(x,Co) = S + 5 T t(x). 
Then if Ep(T) exists, 

(113) L (P,() = L(P,()-5 -5 T Ep(T), 

(114) H (P) = H(P)-5 -5 T Ep(T). 
In particular, for all P S T r , 

(115) Lq(P,() = L(P,()-5q-5 t t, 

(116) H (P) = H(P)-5o-5 T T. 

We see immediately from the definitions that the full, the natural, the 
regular and the linear generalized exponential families generated by Lq and 
T are identical with those generated by L and T. The correspondence r i— ► P r 
is unaffected; for the natural case, if Qp arises from L and Qop from Lq, we 
have Q ,/3 = Q/3+8- 

Suppose in particular that we take any P a £ £ l . In this case we can take 
Co having property (112) to be the corresponding Bayes act Q a . We thus see 
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that a generalized exponential family is unchanged when the loss function 
is redefined by taking it relative to some linear member of the family. This 
property is well known for the case of a standard exponential family, where 
every regular member is linear (with respect to the logarithmic score). In 
that case, the relative loss can also be interpreted as the logarithmic score 
when the base measure [i is changed to P a ; the exponential family is un- 
changed by such a choice. 

8.4.2. Lafferty additive models. Lafferty (1999) defines the additive model 
relative to a Bregman divergence d, reference measure Pq and constraint ran- 
dom variable T : X — ► 1Z as the family of probability measures {Qp : (3 G 1Z} 
where 

(117) Qp := argmin/3E P {T(X)} + d(P, P ). 

p&v 

We note that Pq = Qq is in this family. 

Let S be the Bregman score (29) associated with d and let So be the 
associated relative score So(x,Q) = S(x,Q) — S(x,Pq). Note that by (111) 
d(P, Pq) = — Hq(P), where Hq(P) is the entropy associated with Sq. Laf- 
ferty 's additive models are thus special cases of our natural generalized ex- 
ponential families as defined in Section 7.4, being generated by the specific 
loss function So and statistic T. As shown in Section 8.4.1, when Pq is linear 
(with respect to S and T) the previous sentence remains true on replacing 
S by S. 

These considerations do not rely on any special Bregman properties, and 
so extend directly to any loss-based divergence function d of the form given 
by (103) or (106). 

8.5. Examples. 

8.5.1. Brier score. In the case of the Brier score, the divergence between 
P and Q is given by the squared Euclidean distance between their proba- 
bility vectors: 

(118) d(P, Q) = \\p-qf = 2>0") - q(j)} 2 . 

j 

Using a reference distribution Po, the relative entropy thus becomes 

(H9) H (P) = -Y / {pU)-Po(j)} 2 . 

j 

The uniform distribution over X is neutral. Therefore the distribution within 
a set r that maximizes the Brier entropy is just that minimizing the dis- 
crepancy from the uniform reference distribution Pq. 
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To see the consequences of this for the construction of generalized Brier 
exponential families, let X = { — 1,0, 1} and consider the Brier score picture 
in Figure 2. The bold line depicts the maximum entropy distributions for 
constraints E(T) = r, r G [— 1, 1]. By the preceding discussion, these coincide 
with the minimum Po-discrepancy distributions. For each fixed value of r, 
the set T r = {P :Ep(X) = r} is represented by the vertical line through 
the simplex intersecting the base line at the coordinate r. In Figure 2 the 
cases r = —0.25 and r = 0.75 are shown explicitly. The minimum discrepancy 
distribution within T T will be given by the point on that line within the 
simplex that is nearest to the center of the simplex. This gives us a simple 
geometric means to find the minimum relative discrepancy distributions for 
r G [—1, 1], involving less work than the procedure detailed in Section 7.6.1. 
We easily see that for r G [—2/3,2/3] the minimizing point p T is in the 
interior of the line segment, while for r outside this interval the minimizing 
point is at one end of the segment. 

8.5.2. Logarithmic score. For P G M (i.e., P <C fi) any version p of the 
density dP/dfi is Bayes against P. Then, with q any version of dQ/djjL, 
D(P, q) = E P [log{p(X)/q(X)}] is the Kullback-Leibler divergence KL(P, Q) 
and does not depend on the choice of the versions of either p or q. Again, for 
P,Q G M. we can treat S as a proper scoring rule S(x,Q), with d(P,Q) = 
KL(P, Q) as its associated divergence. [For P ^ M there is no Bayes act 
(see Section 3.5.2), and so, according to our definition (102), the discrepancy 
D{P, q) is not defined: we might define it as +oo in this case.] Maximizing 
the relative entropy is thus equivalent to minimizing the Kullback-Leibler 
divergence in this case. 

There is a simple relationship between the choice of base measure /i, which 
is a necessary input to our specification of the decision problem, and the use 
of a reference distribution for defining relative loss. If we had constructed 
our logarithmic loss using densities starting with a different choice of 
base measure, where fiQ is mutually absolutely continuous with fi, we should 
have obtained instead the loss function So(x,Q) = — loggo(^)i with qo(x) = 
(dQ/dfio)(x) = (dQ/dfj,)(x) x (dfj,/dfio)(x). Thus Sq(x, Q) = S(x, Q) + k(x), 
with k(x) = — logd(x), where d is some version of dfi/d/io. In particular, 
when fiQ is a probability measure, this is exactly the relative loss function 
(107) with respect to the reference distribution jiQ, when we start from the 
problem constructed in terms of [i (in particular, it turns out that this 
relative game will not depend on the starting measure fx). As already deter- 
mined, the corresponding relative entropy function is Hq(P) = —KL(P,hq). 

8.5.3. Zero-one loss. In this case, the discrepancy between P and an act 
( G Z is given by 

(120) ^CP,C)=Pmax-£p(j)C(j)- 
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When X has finite cardinality N, and Co is the randomized act that chooses 
uniformly from X, we have S(x, Co) = 1 — 1/N, so that this choice of Co 
is neutral. 

Take X = { — 1,0, 1} and T = X, let Co be uniform on X and consider the 
minimum zero-one Co-discrepancy distributions shown in Figure 2. Deter- 
mining this family of distributions geometrically is easy once one has deter- 
mined the contours of constant generalized entropy, since these are also the 
contours of constant discrepancy from Co ■ 

8.5.4. Bregman divergence. In a finite sample space, the Bregman score 
(29) generates the Bregman divergence (30). Thus minimizing the Bregman 
divergence is equivalent to maximizing the associated relative entropy, which 
is in turn equivalent to finding a distribution that is robust Bayes against 
the associated relative loss function. Minimizing a Bregman divergence has 
become a popular tool in the construction and analysis of on-line learning 
algorithms [Lafferty (1999) and Azoury and Warmuth (2001)], on account 
of numerous pleasant properties it enjoys. As shown by properties (i)-(iv) 
of Section 8.1 and as will further be seen in Section 10, many of these 
properties generalize to an arbitrary decision-based divergence function as 
defined by (103) or (106). 

In more general sample spaces, the separable Bregman score (34) gener- 
ates the separable Bregman divergence tL, given by (37). When the measure 
\x appearing in these formulae is itself a probability distribution, \i will be 
neutral (uniquely so if ip is strictly convex); then minimizing over P the 
separable Bregman divergence d^(P,fi) of P from fj, becomes equivalent to 
maximizing the separable Bregman entropy H(P) as given by (38). 

9. Statistical problems: discrepancy as loss. In this section we apply the 
general ideas presented so far to more specifically statistical problems. 

9.1. Parametric prediction problems. In a statistical decision problem, 
we have a family {P^ :uj 6 £1} of distributions for an observable X over X, 
labelled by the values a; of a parameter £1 ranging over f2; the consequence 
of taking an action a depends on the value of 0. We shall show how one 
can construct a suitable loss function for this purpose, starting from a gen- 
eral decision problem Q with loss depending on the value of X, and relate 
the minimax properties of the derived statistical game Q to those of the 
underlying basic game Q. 

In our context X is best thought of as a future outcome to be predicted, 
perhaps after conducting a statistical experiment to learn about $7. The 
distributions of X given Q = uj would often be taken to be the same as 
those governing the data in the experiment, but this is not essential. Our 
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emphasis is thus on statistical models for prediction, rather than for ob- 
served data: the latter will not enter directly. For applications of this pre- 
dictive approach to problems of experimental design, see Dawid (1998) and 
Dawid and Sebastiani (1999). 

9.2. Technical framework. Let (X,B) be a separable metric space with 
its Borel u-field, and let V§ be the family of all probability distributions 
over {X,B). We shall henceforth want to consider Vq itself (and subsets 
thereof) as an abstract "parameter space." When we wish to emphasize this 
point of view we shall denote Vq by Oo, and its typical member by 9; when 
9 is considered in its original incarnation as a probability distribution on 
(X,B), we may also denote it by Pq. 

00 becomes a metric space under the Prohorov metric in Vq, and the as- 
sociated topology is then identical with the weak topology on Vq [Billingsley 
(1999), page 72]. We denote the set of all probability distributions, or laws, 
on the Borel cr-field C in Go by Cq. Such a law can be regarded as a "prior dis- 
tribution" for a parameter random variable taking values in Go- For such 
a law II G Co, we denote by Pr G Vq its mean, given by Pn(A) = Ejj{P®(A)} 
(AG B): this is just the marginal "predictive" (mixture) distribution for X 
over X , obtained by first generating a value 9 for Q from II, and then gen- 
erating X from Pq. 

9.3. The derived game. Starting from a basic game Q = (X,A,L), we 
construct a new derived game, Q := (@,A,L). The new loss function L on 
x A is just the discrepancy function for the original game Q, 

(121) L(9,a):=D(P e ,a), 

and the original sample space X is replaced by := {# G ©o : D(Pg,a) is 
defined for all a G A}. 
We have 

(122) L{9,a) = L{P ,a)-H(P e ) 

when H(Pg) is finite. Properties (121) and (122) then extend directly to 
randomized acts £ G Z for DM. A randomized act for Nature in Q is a 
law putting all its mass on O C Qq. We shall denote the set of such laws 
by£C£ . ^ 

Note that L(9,a) is just the regret associated with taking action a when 
X ~ Pq. It is nonnegative, and it vanishes if and only if a is Bayes against Pq. 
Such a regret function will often be a natural loss function to use in a 
statistical decision problem. 

Since L > 0, the expected loss L(HX) is defined in [0, oo] for all II G C, 
Q G Z. From (122) we obtain 

(123) L(U, C) = L(P n ,C) - J H(P ) dU(9) 
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when the integral exists. An act Co will thus be Bayes against II in Q if and 
only if it is Bayes against Pn hi Q ■ More generally, this equivalence follows 
from the property En{£(0,C) ~ L(&,Co)} = E Pn {L(X,C) - L(X,(o)}. In 
particular, if L is a Q-proper scoring rule in the basic game Q, and the 
mixture distribution Pq €E Q, then Pq will be Bayes against II in Q. 
The derived entropy function is 

(124) H (II) = H(Pa) - J H(P e ) dU(6) 

(when the integral exists) and is nonnegative. This measures the expected 
reduction in uncertainty about X obtainable by learning the value of G, 
when initially ~ II: it is the expected value of information [DeGroot (1962)] 
in about X. 

The derived discrepancy is just 

(125) D(U,C) = D(P n ,C). 

9.4. A statistical model. Let f2 C O : for example, O might be a para- 
metric family of distributions for X . We can think of as the statistical 
model for the generation of X. We will typically write lo or P w for a member 
of and use Q to denote the parameter when it is restricted to taking 
values in £1. We denote by A C Cq the class of laws on ©o that give all their 
mass to O and can thus serve as priors for the parameter f2 of the model; 
we denote by T C Vq the family {Pq : II 6 A} of all distributions for X ob- 
tainable as mixtures over the model f2. Clearly both A and V are convex. 

Lemma 9.1. Suppose that the family f2 of distributions on (X,B) is 
tight. Then so too are T and A [the latter as a family of laws on (Qq,C)]. 

Proof. The tightness of T follows easily from the definition. 

Let denote the closure of O in ©o- Since f2 is tight, so is O [use, e.g., The- 
orem 3.1.5(iii) of Stroock (1993)], and then Prohorov's theorem [Billingsley 
(1999), Theorem 5.1] implies that f2 is compact in the weak topology. Any 
collection (in particular, A) of distributions on (@o,C) supported on 17 is 
then necessarily tight. □ 

9.5. Minimax properties. Now consider a statistical model with 0, C 
(so that A C C). We can tailor the derived game Q to this model by simply 
restricting the domain of L to 0, x A. We would thus be measuring the loss 
(regret) of taking act £ G Z, when the true parameter value is lo £ f2, by 
L(uj, () = D(P UJ , (). Alternatively, and equivalently, we can focus attention 
on the restricted game as defined in Section 4.2, with A the family of 
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laws supported on the model O. In the present context we shall denote this 
by § n . 

We will often be interested in the existence and characterization of a 
value, saddle-point, maximum entropy (maximin) prior IT* or robust Bayes 
(minimax) act £*, in the game Q . Note in particular that, when we do 
have a saddle-point (II*,£*) in with value H* , we can use Lemma 4.2 
to deduce that IT* must put all its mass on T:={we!]: D{P^ : (*) = H*}: 
in particular, with II*-prior probability 1 the discrepancy from the minimax 
act is constant. When, as will typically hold, T is a proper subset of Q, we 
further deduce from Corollary 4.4 that £* is not an equalizer rule in Q n . 

To investigate further the minimax and related properties of the game Q^ 1 , 
we could try to verify directly for this game the requirements of the general 
theorems already proved in Sections 5-7. However, under suitable conditions 
these required properties will themselves follow from properties of the basic 
game Q. We now detail this relationship for the particular case of Theo- 
rem 6.4. 

We shall impose the following condition: 

Condition 9.1. There exists K e U such that H(P UJ ) > K for all wed. 

By concavity of H, Condition 9.1 is equivalent to H(Q) > K for all Q 6 V. 
The following lemma is proved in the Appendix. 

Lemma 9.2. Suppose Condition 9.1 holds. Then if Conditions 6.1 and 6.3 
hold for L and V (in Q), they likewise hold for L and A (in Q). 

The next theorem now follows directly from Lemmas 9.1 and 9.2 and 
Theorem 6.4. 



Theorem 9.1. Suppose Conditions 6.1, 6.3 and 9.1 all hold for L andT 
in Q and, in addition, the statistical model is tight. Then H* := sup ngA H (II) 
is finite, the game Q n has value H* and there exists a minimax (robust 
Bayes) act £* in Q^ 1 such that 

(126) sup L(u, £*) = inf sup L{oj, Q) = sup inf L(IT, a) = H* . 

We remark that the convexity requirement on T in Condition 6.3 will 
be satisfied automatically, while the finite entropy requirement is likewise 
guaranteed by Condition 9.1 and the assumed finiteness of H* . 

The proof of Theorem A. 2 shows that we can take £* to be Bayes in Q 
against some law IT* in the weak closure A of A (or, equivalently, Bayes 
in Q against P* := P~ in the weak closure T of T). However, in general, 
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if A is not weakly closed, C* need not be a Bayes act in Q against any 
prior distribution II G A (equivalently, not Bayes in Q against any mixture 
distribution Pq ET). 

On noting that for any reference act C,q the games Q T and Qq induce the 
same derived game, and using (111), we have the following. 

Corollary 9.1. Suppose that there exists C$^Z such that Conditions 
6.1 and 6.3 hold for Lq and V in the relative game Gq, and, in addition, 
that C is tight. Suppose further that D(P uj ,^q) is bounded above for uj G Cl. 
Then there exists a minimax (robust Bayes) act Q* in the game Q . 

If the boundedness condition in Corollary 9.1 fails, we shall have 
(127) supL(w,Co) = sup^(P a; ,Co) = oo. 

It can thus fail for all Co £ Z only when inf^ g ^ sup wgf2 L(u>, Q = oo; that is, 
the upper value of the game Q^ 1 is oo. In this case the game has no value, 
and any £ G Z will trivially be minimax in Q . In the contrary case, we 
would normally expect to be able to find a suitable G Z to satisfy all the 
conditions of Corollary 9.1 and thus demonstrate the existence of a robust 
Bayes act £* in Q n . 

9.6. Kullback-Leibler loss: the redundancy- capacity theorem. An impor- 
tant special case arises when the model Cl is dominated by a <r-finite mea- 
sure fx, and the loss function L in Q is given by the logarithmic score (20) 
with respect to fj,. In this case, for any possible choice of fx, the derived loss 
is just the Kullback-Leibler divergence, L(u>,P) = KL(P U} ,P). We call such 
a game a Kullback-Leibler game. The corresponding derived entropy H(T1), 
as given by (124), becomes the mutual information, In(X,Q), between X 
and 0, in their joint distribution generated by the prior distribution II for £1 
[Lindley (1956)]. There has been much research, especially for asymptotic 
problems, into the existence and properties of a maximin "reference" prior 
distribution II over ft maximizing this mutual information, or of a mini- 
max act (which can be regarded as a distribution P* G M over X) for DM 
[Bernardo (1979), Berger and Bernardo (1992), Clarke and Barron (1990, 
1994), Haussler (1997) and Xie and Barron (2000)]. 

The following result follows immediately from Corollary 9.1 and Proposi- 
tion A.l. 

Theorem 9.2. Suppose that loss on Q x A is measured by L(u>,P) = 
KL(P aJ ,P), and that the model ft is tight. Then there exists a minimax act 
P* G A4 for Q n , achieving mlp^M sup^g^ KL(Pj, P). When this quantity is 
finite it is the value of the game and equals the maximum attainable mutual 
information, I* := sup ngA In(A, CI). 
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Theorem 9.2, a version of the "redundancy-capacity theorem" of infor- 
mation theory [Gallager (1976), Ryabko (1979), Davisson and Leon-Garcia 
(1980) and Krob and Scholl (1997)], constitutes the principal result (Lemma 3) 
of Haussler (1997). Our proof techniques are different, however. 

If I* is achieved for some II* E A, then (II*, P*) is a saddle-point in 
Q Q , whence, since P* is then Bayes in Q against II*, P* is the mixture 
distribution Pg„ = / P w cUL*(u). Furthermore, since Lemma 4.2 applies in 

this case, we find that II* must be supported on the subspace T := {lo E 
$7 : KL(P W , P*) = I*}. As argued in Section 4.3, for the case of a continu- 
ous parameter-space IT* will typically be a discrete distribution. Notwith- 
standing this, it is known that, for suitably regular problems, as sample size 
increases this discrete maximin prior converges weakly to the absolutely con- 
tinuous Jeffreys invariant prior distribution [Bernardo (1979), Clarke and Barron 
(1994) and Scholl (1998)]. 

10. The Pythagorean inequality. The Kullback-Leibler divergence satis- 
fies a property reminiscent of squared Euclidean distance. This property was 
called the Pythagorean property by Csiszar (1975). The Pythagorean prop- 
erty leads to an interpretation of minimum relative entropy inference as an 
information projection operation. This view has been emphasized by Csiszar 
and others in various papers [Csiszar (1975, 1991) and Lafferty (1999)]. Here 
we investigate the Pythagorean property in our more general framework and 
show how it is intrinsically related to the minimax theorem: essentially, a 
Pythagorean inequality holds for a discrepancy function D if and only if the 
loss function L on which D is based admits a saddle-point in a suitable 
restricted game. Below we formally state and prove this; in Section 10.2 we 
shall give several examples. 

Let r C V be a family of distributions over X, and let Co be a reference 
act, such that L(P, Co) is finite for all P E T [so that Lq(P,C) is defined for 
all P E r, C E Z\. We impose no further restrictions on T (in particular, 
convexity is not required). Consider the relative restricted game Qq , with 
loss function Lq(P, a), for P E T, a E A. We allow randomization over A but 
not over T. The entropy function for this game is Hq(P) = —D(P, Co) and is 
always nonpositive. 

Theorem 10.1. Suppose (P*,C*) is a saddle-point in Qq. Then for 
all PeT, 

(128) D(P,e) + D(P*,( )<D(P,( ). 

Conversely, if (128) holds with its right-hand side finite for all P E T, then 
(P*,C*) is a saddle-point in Qq. 



56 P. D. GRUNWALD AND A. P. DAWID 

Proof. Let Hq := H (P*) = -D(P*,(o)- If (P*,C) is a saddle-point 
in Ql, then H$ = L (P*,(*) and is finite. Also, for all FgT, 

(129) L (P,C)<H*. 

If Hq(P) = -co, then D(P, Co) = oo, so that (128) holds trivially. Otherwise, 

(129) is equivalent to 

(130) {L (P, C) ~ H (P)} + {-H* } < {-H (P)}, 

which is just (128). 

Conversely, in the case that D{P, Co) is finite for all P G T, (128) im- 
plies (129). Also, putting P = P* in (128) gives D(P*,(*) = 0, which is 
equivalent to C* being Bayes against P*. Moreover, H(P*) = D(P* , Co) is 
finite. By (44), (P*,(*) is a saddle-point in £o • □ 

Corollary 10.1. If S is a Q-proper scoring rule and T C Q, then in 
the restricted relative game Qq having loss Sq(P,Q) (for fixed reference dis- 
tribution Pq G Q), if (P*,P*) is a saddle-point (in which case P* is both 
maximum entropy and robust Bayes), then for all P G T, 

(131) d(P,P*) + d(P*,P )<d(P,P ). 

Conversely, if (131) holds and d(P,P ) < oo for all P G T, then (P*,P*) is 
a saddle-point in Qq. 

We shall term (128), or its special case (131), the Pythagorean inequality. 
We deduce from (128), together with D(P,(o) = -H (P), that for all 
PGT, 

(132) H (P*)-H (P)>D(P,C), 

providing a quantitative strengthening of the maximum relative entropy 
property, H (P*) - H (P) > 0, of P*. Similarly, (131) yields 

(133) H (P*)-H (P)>d(P,P*). 

Often we are interested not in the relative game Qq but in the original 
game Q r . The following corollary relates the Pythagorean inequality to this 
original game: 

Corollary 10.2. Suppose that in the restricted game Q T there exists 
an act Co G Z such that L(P, Co) = k G 1Z, for all P G T (in particular, this 
will hold if Co is neutral). Then, if (P*,C*) is a saddle-point in Q r , (128) 
holds for all P G T; the converse holds if H{P) is finite for all P G T. 



MAXIMUM ENTROPY AND ROBUST BAYES 



57 



10.1. Pythagorean equality. Related work to date has largely confined it- 
self to the case of equality in (128). This has long been known to hold for the 
Kullback-Leibler divergence of Section 8.5.2 [Csiszar (1975)]. More recently 
[Jones and Byrne (1990), Csiszar (1991) and Delia Pietra, Delia Pietra and Lafferty 
(2002)], it has been shown to hold for a general Bregman divergence under 
certain additional conditions. This result extends beyond our framework in 

that it allows for divergences not defined on probability spaces. On the other 
hand, when we try to apply it to probability spaces as in Section 3.5.4, its 
conditions are seen to be highly restrictive, requiring not only differentiabil- 
ity but also, for example, that the tangent space VH(q) of H at q should 
become infinitely steep as q approaches the boundary of the probability sim- 
plex. This is not satisfied even for such simple cases as the Brier score: see 
Section 10.2.1, where we obtain strict inequality in (128). 

The following result follows easily on noting that we have equality in (128) 
if and only if we have it in (129): 

Theorem 10.2. Suppose (P*,£*) is a saddle-point in Qq . If Q* is an 
equalizer rule in Ql [i.e., L Q (P,(*) = H (P*) for all P G T], then (128) 
holds with equality for all P G T. Conversely, if (128) holds with equality, 
then L (P, £*) = H (P*) for all P G T such that D(P, ( ) < oo; in particular, 
if D(P, £o) < oo for all P G T, £* is an equalizer rule in Qq . 

Combining Theorem 10.2 with Theorem 7.1(i) or Corollary 7.2 now gives 
the following: 

Corollary 10.3. Let T = T T = {P G V : E P {t(X)} = r}. Suppose (P*,C) ■= 
(P T , Ct) is a saddle-point in Qq. //either (P T , £ T ) is a linear pair or P< P T , 
then (128) holds with equality. 

10.2. Examples. We now illustrate the Pythagorean theorem and its con- 
sequences for our running examples. 

10.2.1. Brier score. Let X be finite. As remarked in Section 8.5.1, the 
Brier divergence d(P,Q) between two distributions P and Q is just — q\\ 2 . 
Let r C V be closed and convex. By Theorem 5.2, we know that there then 
exists a P* G T such that (P*,P*) is a saddle-point in the relative game Qq . 
Therefore, by Corollary 10.1 we have, for all P G T, 



(134) 



\\p-p*\\ 2 + Hp* - poll 2 < Wp-po 



or equivalently, 



(135) 



(P-P*f(p*-P0)<0. 
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The distribution P* within T that maximizes the Brier entropy relative 
to Poj ° r equivalently that minimizes the Brier discrepancy to Pq, is given 
by the point closest to Po in T, that is, the Euclidean projection of Po onto T. 
That this distribution is also a saddle-point is reflected in the fact that the 
angle Z(p, p*,p ) > 90° for all FgT. 

Consider again the case X = { — 1,0,1} and constraint Kp(X) = r. For 
r G [—2/3, 2/3] , where (except for the extreme cases) the minimizing point p T 
is in the interior of the line segment, (135), and so (134), holds with equality 
for all P G T T ; while for r outside this interval, where the minimizing point 
is at one end of the segment, (135) and (134) hold with strict inequality 
for all P G T T \ {P T }. Note further that in the former case p T is linear; 
for t G (—2/3,2/3) p T is in the interior of the simplex, so that P T has full 
support. Hence, by Theorem 7. 1 (i) or Corollary 7.2, p T is an equalizer rule. In 
the latter case P T does not have full support, and indeed the strict inequality 
in (134) implies by Theorem 10.2 that it cannot be an equalizer rule. 

We can also use (135) to investigate the existence of a saddle-point for 
certain nonconvex V. Thus suppose, for example, that T is represented in 
the simplex by a spherical surface. Then the necessary and sufficient condi- 
tion (135) for a saddle-point will hold for a reference point p° outside the 
sphere, but fail for p° inside. In the latter case Corollary 4.1 does not apply, 
and the maximum Brier entropy distribution in T (the point in T closest to 
the center of the simplex) will not be robust Bayes against T. 

10.2.2. Logarithmic score. In this case d(P, Q) becomes the Kullback- 
Leibler divergence KL(P, Q) (P,Q G Ai). This has been intensively studied 
for the case of mean- value constraints T^ 4 = {P £ M :E P (T) = r} (r G T°), 
when the Pythagorean property (131) holds with equality [Csiszar (1975)]. 
By Theorem 10.2 this is essentially equivalent to the equalizer property of 
the maximum relative entropy density p T , as already demonstrated (in a way 
that even extends to distributions P £T T \A4) in Section 7.3. (Recall from 
Section 8.5.2 that in this case the relative entropy, with respect to a reference 
distribution Po, is simply the ordinary entropy under base measure Po-) 

In the simple discrete example studied in Section 7.6.2, the above equal- 
izer property also extended (trivially) to the boundary points r = ±1. Such 
an extension also holds for more general discrete sample spaces, since the 
condition of Corollary 7.2 can be shown to apply when r is on the bound- 
ary of T. So in all such cases the Pythagorean inequality (131) is in fact 
an equality. 

10.2.3. Zero-one loss. For the case X = { — 1, 0, 1} and constraint Ep(X) = r, 
with Co uniform on X, we have H (P) = H(P) - 1 + l/N, and then (132) 
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(equivalent to both the Pythagorean and the saddle-point property) asserts: 
for all PeT T , 



This can be confirmed for the specifications of P T and £ r given in Tables 
2 and 3. Specifically, for < r < i, both sides of (137) are (1 + r)/3 (the 
equality confirming that in this case we have an equalizer rule), while, for 
2 < t < 1, (137) becomes r <p(l), which holds since r = p(l) — p(— 1) (in 
particular we have strict inequality, and hence do not have an equalizer 
rule, unless r = 1). For r = ^, we calculate J2p{ x )Ct{%) — Pr,max = (1 — 
3a)p(— 1), which is nonnegative since a < 1/3, so verifying the Pythagorean 
inequality, and hence the robust Bayes property of C1/2 = (0,o, 1 — a) for 
a < I — although this will be an equalizer rule only for a = g. Similar results 
hold when — 1 < r < 0. 

11. Conclusions and further work. 

11.1. What has been achieved. In this paper we started by interpreting 
the Shannon entropy of a distribution P as the smallest expected logarith- 
mic loss a DM can achieve when the data are distributed according to P. 
We showed how this interpretation (a) allows for a reformulation of the 
maximum entropy procedure as a robust Bayes procedure and (b) can be 
generalized to supply a natural extension of the concept of entropy to ar- 
bitrary loss functions. Both these ideas were already known. Our principal 
novel contribution lies in the combination of the two: the generalized en- 
tropies typically still possess a minimax property, and therefore maximum 
generalized entropy can again be justified as a robust Bayes procedure. For 
some simple decision problems, as in Section 5, this result is based on an 
existing minimax theorem due to Ferguson (1967); see the Appendix, Sec- 
tion A.l. For others, as in Section 6, we need more general results, such as 
Lemma A.l, which uses a (so far as we know) novel proof technique. 

We have also considered in detail in Section 7 the special minimax re- 
sults available when the constraints have the form of known expectations 
for certain quantities. Arising out of this is our second novel contribution: 
an extension of the usual definition of "exponential family" to a more gen- 
eral decision framework, as described in Section 7.4. We believe that this 
extension holds out the promise of important new general statistical theory, 
such as variations on the concept of sufficiency. 

Our third major contribution lies in relating the above theory to the prob- 
lem of minimizing a discrepancy between distributions. This in turn leads 



(136) H(P T ) - H(. 

Using (25) and (120), (136) becomes 
(137) 
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to two further results: in Section 9.5 we generalize Haussler's minimax theo- 
rem for the Kullback-Leibler divergence to apply to arbitrary discrepancies; 
in Section 10 we demonstrate the equivalence between the existence of a 
saddle-point and a "Pythagorean inequality." 

11.2. Possible developments. We end by discussing some possible exten- 
sions of our work. 

11.2.1. Moment inequalities. As an extension to the moment equalities 
discussed in Section 7, one may consider robust Bayes problems for moment 
inequalities, of the form T = {P :~Ep(T) E A}, where A is a general (closed, 
convex) subset of lZ k . A direct approach to (39) is complicated by the com- 
bination of inner maximization and outer minimization [Noubiap and Seidel 

(2001) ]. Replacement of this problem by a single maximization of entropy 
over r could well simplify analysis. 

11.2.2. Nonparametric robust Bayes. Much of robust Bayes analysis in- 
volves "nonparametric" families T: for example, we might have a reference 
distribution Pq, but, not being sure of its accurate specification, wish to 
guard against any P in the "e-neighborhood" of Pq, that is, {Pq + c(P — 
Po) '■ \ c \ < e ,P arbitrary}. Such a set being closed and convex, a saddle-point 
will typically exist, and then we can again, in principle, find the robust Bayes 
act by maximizing the generalized entropy. However, in general it may not 
be easy to determine or describe the solution to this problem. 

11.2.3. Other generalizations of entropy and entropy optimization prob- 
lems. It would be interesting to make connections between the generalized 
entropies and discrepancies defined in this text and the several other gener- 
alizations of entropy and relative entropy which exist in the literature. Two 
examples are the Renyi entropies [Renyi (1961)] and the family of entropies 
based on expected Fisher information considered by Borwein, Lewis and Noll 
(1996). 

Finally, very recently, Harremoes and TopsOe [TopsOe (2002) and Harremoes and TopsOe 

(2002) ] have proposed a generalization of Tops0e's original minimax char- 
acterization of entropy [TopsOe (1979)]. They show that a whole range of 
entropy-related optimization problems can be interpreted from a minimax 
perspective. While Harremoes and TopsOe's results are clearly related to 
ours, the exact relation remains a topic of further investigation. 

APPENDIX: PROOFS OF MINIMAX THEOREMS 

We first prove Theorem 6.1, which can be used for loss functions that 
are bounded from above, and Theorem 6.2, which relates saddle-points to 
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differentiability of the entropy. We then prove a general lemma, Lemma A.l, 
which can be used for unbounded loss functions but imposes other restric- 
tions. This lemma is used to prove Theorem 6.3. Next we demonstrate a 
general result, Theorem A. 2, which implies Theorem 6.4. Finally we prove 
Lemma 9.2. 

A.l. Theorem 6.1: L upper-bounded, Y closed and tight. The following 
result follows directly from Theorem 2 of Ferguson [(1967), page 85]. 

Theorem A.l. Consider a game {X,A, L). Suppose that L is bounded 
below and that there is a topology on Z, the space of randomized acts, such 
that the following hold: 

(i) Z is compact. 

(ii) L : X x £ — > 1Z is lower semicontinuous in £ for all x £ X . 

Then the game has a value, that is, supp g -p inf ag _4 L(P, a) = inf^g^ sup^g^ L(x, £). 
Moreover, a minimax Q, attaining inf^ e ^ sup^g^ L{x,Q, exists. 

Note that Z could be any convex set. By symmetry considerations, we 
thus have the following. 

Corollary A.l. Consider a game (T,A,L). Suppose that L is bounded 
above and there is a topology on T such that the following hold: 

(i) r is convex and compact. 

(ii) L :T x A^ 1Z is upper semicontinuous in P for all a E A. 

Then the game has a value, that is, inf^g^ sup^g^ L(x, £) = supp gr inf ag _4 L(P, a) . 
Moreover, a maximin P, attaining supp gr inf ae _4 L{P, a), exists. 

Proof of Theorem 6.1. Since V is tight and weakly closed, by Pro- 
horov's theorem [Billingsley (1999), Theorem 5.1] it is weakly compact. Also, 
under the conditions imposed L(P,a) is, for each a £ A, upper semicontin- 
uous in P in the weak topology [Stroock (1993), Theorem 3.1.5(v)]. Theo- 
rem 6.1 now follows from Corollary A.l. □ 

A.2. Theorems 6.2 and 6.3: L unbounded, sup H(P) achieved. Through- 
out this section, we assume that T is convex and that H* := supp gr -ff (P) 
is finite and is achieved for some P* £ T admitting a not necessarily unique 
Bayes act (* . 

To prove that (P*,£*) is a saddle-point, it is sufficient to show that 
L(P,(*)<L(P*,C) = H* for all PgT. 



PROOF of Theorem 6.2. By Lemma 3.2, L(P,(*) and L(P ,(*) are 
finite, and /(A) :=L(Q X ,C) is linear in A G [0,1]. Also, /(A) > H(Q\) for 
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A=A* 



all A and /(A*) = H{Q X *) = H* . Thus /(A) must coincide with the tangent 
to the curve H{Q\) at A = A*. It follows that 

(138) L(P, C) = /(I) = H* + (1 - A) j (±^H(Q X )} 
However, 

since H(Q X ) < H* for A > A*. We deduce L(P, (*)<H*. □ 

Note. If Pq in the statement of Theorem 6.2 can be chosen to be in T, 
then we further have H(Q X ) < H* for A < A*, which implies {(d/d\)H(Q x )} x=x * 
0, and hence L(P, £*) = if*. In particular, if this can be done for all PeT 
(i.e., P* is an "algebraically interior" point of V), then (* will be an equalizer 
rule. 

From this point on, for any P E T, A E [0, 1] we write P\ := XP + (1 — A)P* . 
Then, since we are assuming T convex, P\ E T. 

Lemma A.l. Suppose Conditions 6.3 and 6.4 hold. Let Ca be Bayes 
against P\ (in particular, (* := Co is Bayes against P* , and Ci is Bayes 
against P). Then 

(139) L(P, 60 - L{P\ Ca) = g(PA) " A L(P%CA) 

(140) < 

(Q<\<1). Moreover, lim^o L(P*, Ca) nm Aj.o -^(P> Ca) ^oi/i exist as 
nife numbers, and 

(141) limL(P*,CA)=lT. 

AJ.U 

Proof. First note that, since H(P\) = L(P\, Ca) is finite, by Lemma 3.2 
both L(P, Ca) and L(P* , Ca) are finite for < A < 1. Also by Lemma 3.2, for 
all C £ -2^ L{P\,Q is, when finite, a linear function of A E [0, 1]. Then 

AL(P,C) + (1-A)L(P*,C)=L(Pa,C) 

(142) >H(P x ) = L(P x ,(x) 

(143) =AL(P,Ca) + (1-A)L(P*,Ca). 

On putting C = Ca we have equality in (142); then rearranging yields (139), 
and (140) follows from L{P*,( X ) > H* and H(P X ) < H* . 



MAXIMUM ENTROPY AND ROBUST BAYES 



(;:-! 



For general (GZ we obtain (when all terms are finite) 

(144) A{L(P, Ca) - L(P, C)} < (1 - A){L(P* , C) - L(P* , Ca)}. 

Put C = Ci> so that L(P, Ci) = H(P) is finite, and first suppose that L(P* 
is finite. Then the left-hand side of (144) is nonnegative, and so L(P*, Ci) > 
L(P*,(\) (0 < A < 1) — which inequality clearly also holds if L(P*, Ci) = oo. 
An identical argument can be applied on first replacing Ci by Ca' (0 < A' < 1), 
and we deduce that L(P*,(\>) > L(P*,(\) (0 < A < A' < 1). That is to say, 
L(P*, Ca) is a nondecreasing function of A on [0,1]. It follows that 

(145) limL(P*,CA)> L(P*, Co) = H* . 

AJ,0 

A parallel argument, interchanging the roles of P* and P, shows that 
L(PXx) is nonincreasing in A G [0,1]. Since, by (140), for all AG (0,0.5], 
L(P, (x) < L(P*,Cx) < L(P*, C0.5) < 00, it follows that limA;o L(P, Ca) exists 
and is finite. 

Since P* maximizes entropy over T, 

H(P*)-L(P*Xx) > H(P x )-L(P*Xx) 

(146) 

= \{L(P,(x)-L(P*,( x )}, 

by (143). On noting L(P*,(x) < L(P*,(i) since L(P*,(x) is nondecreas- 
ing, and using L(P,Ca) > H(P), (146) implies H* - L(P*,(x) > A{fl"(P) - 
L(P*, Ci)}. If L(P*, Ci) < 00, then letting A | we obtain H* > lim xl0 L(P*,(\), 
which, together with (145), establishes (141). Otherwise, noting that L(P* , Co. 5) < 
00, we can repeat the argument with P replaced by Po.5- □ 

Corollary A. 2. 

(147) hmL(P, Ca) - H* = lim g ™ " L(P *' Ca) . 
A^O A^0 A 

COROLLARY A. 3 (Condition for existence of a saddle-point). L(P, C*) < 
H(P*) if and only if 

(148) hm g(PA) " x L(P * lCA) < limL(P,CA) - £(PC*)- 

PROOF of Theorem 6.3. The conditions of Lemma A.l are satisfied. 
By Corollary A. 3 and (140), we see that it is sufficient to prove that, for all 

Per, 

(149) 0<]imL(P,<Zx)-L(P,C). 

AJ,U 

However, (149) is implied by Condition 6.1. □ 
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A. 3. If supp g p-ff(P) i s n °t achieved. In some cases supp gr H(P) may 
not be achieved in T [Tops0e (1979)]. We might then think of enlarging T 
to, say, its weak closure T. However, this can be much bigger than T. For 
example, for uncountable X, the weak closure of a set, all of whose members 
are absolutely continuous with respect to u, typically contains distributions 
that are not. Then Theorem 6.3 may not be applicable. 

Example A.l. Consider the logarithmic score, as in Section 3.5.2, with 
X = 1Z and ji Lebesgue measure, and let T = {P : P < u, E(X) = 0, E(X 2 ) = 
1}. Then T contains the distribution P with P(X = 1) = P(X = -1) = 1/2, 
for which H(P) = — oo. There is no Bayes act against this P. 

This example illustrates that, in case supp gr H(P) is not achieved [for 
an instance of this, see Cover and Thomas (1991), Chapter 9], we cannot 
simply take its closure and then apply Theorem 6.3, since Condition 6.3 
could still be violated. 

The following theorem, which in turn implies Theorem 6.4 of Section 6, 
shows that the game (T,A,L) will often have a value even when T is not 
weakly closed. We need to impose an additional condition: 

Condition A.l. Every sequence (Q n ) of distributions in V such that 
H{Q n ) converges to H* has a weak limit point in Vq. 

Theorem A. 2. Suppose Conditions 6.1, 6.3 and A.l hold. Then there 
exists £ Z such that 

(150) sup L(P, C) = inf sup L(P, () = sup inf L{P, a) = H* . 
Per Ce^Per p e rae-4 

In particular, the game Q r has value H* , and (,* is robust Bayes against T. 

Proof. Let (Q n ) be a sequence in T such that H(Q n ) converges to 
H* . In particular, (H(Q n )) is bounded below. On choosing a subsequence if 
necessary, we can suppose by Condition A.l that (Q n ) has a weak limit P* , 
and further that for all n H* — H(Q n ) < 1/n. By Condition 6.1, P* has a 
Bayes act £*• 

Now pick any P £T. We will show that L(P, £*) < H* . First fix n and de- 
fine Rl := AP+(1- X)Q n , H% := H(B%) (0 < A < 1). In particular, R% = Q n , 
= P. Then #J G T, with Bayes act C", say. We have H% = L(R%, Q) = 
XL(P, + (1 - A)L(i$, Cx), while H$ < L(R%, Q). It follows that 

(151) L(PX n x)<HZ + {Hl-H%)/\. 

Since H% = H(Q n ) > H* - 1/n and H$, H% < H*, we obtain 

(152) L(P, Q,r-) <H* + 1/n + 1/v^. 
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Now with Q' n ;=R™.r-, (Q' n ) converges weakly to P*. Moreover, H(Q' n ) > 

(1/ 'y/n)H(P) + (1 — 1/ ' y/n)H(Q n ) is bounded below. On applying Condi- 
tion 6.1 to Q' n , and using (152), we deduce 

(153) L(P,C)<H*. 
It now follows that 

(154) inf su P L(P,C) < su P L(P,C*) < H*. 
Ce^p 6r p 6r 

However, 

(155) H* = sup inf L(P, a) = sup inf L{P, Q < inf sup L(P, C), 

Pgrae^l p e rCe2 ceZpgr 

where the the second equality follows from Proposition 3.1 and the third 
inequality is standard. Together, (154) and (155) imply the theorem. □ 

Proof of Theorem 6.4. If r is tight, then by Prohorov's theorem 
any sequence (Q n ) hi T must have a weak limit point, so that, in particular, 
Condition A.l holds. □ 

It should be noted that, for P* appearing in the above proof, we may 
have H(P*) ^ H* . In the case of Shannon entropy, we have H(P*) < H*; 
a detailed study of the case of strict inequality has been carried out by 
Harremoes and Tops0e (2001). 

We now show, following Csiszar (1975) and Tops0e (1979), that the condi- 
tions of Theorem A. 2 are satisfied by the logarithmic score. We take L = S, 
the logarithmic score (20) defined with respect to a measure fi. This is M- 
strictly proper, where M is the set of all probability distributions absolutely 
continuous with respect to \x. 

Proposition A.l. Conditions A.l and 6.2 are satisfied for the loga- 
rithmic score S relative to a measure \i if either of the following holds: 

(i) n is a probability measure and Q = M; 

(ii) X is countable, ^ is counting measure and Q = {P S V : H{P) < oo}. 

Proof. To show Condition A.l, under either (i) or (ii), let (Q n ) be 
a sequence of distributions in V such that H(Q n ) converges to H* . Given 
e > 0, choose N such that, for n> N, H* — H{Q n ) < e. Then for n,m> N, 
on applying (104) we have 

H*>H{\(Q n + Q m )} 

= \[H{Q n ) + H{Qm) + KL{Q„, |(Q n + Q m )} 

(156) 

+ KL{Q m , \{Q n + Q m )}] 

> H* — £ + JqWQu — Qm|| 2 , 
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where || • || denotes total variation and the last inequality is an application 
of Pinsker's inequality KL(P ll P 2 ) > (l/4)||Pi - P 2 || 2 [Pinsker (1964)]. That 
is, n,m > iV \\Q n - Q m \\ 2 < 16e, so that (Q n ) is a Cauchy sequence un- 
der || • || . Since the total variation metric is complete, (Q n ) has a limit Q 
in the uniform topology, which is then also a weak limit [Stroock (1993), 
Section 3.1]. This shows Condition A.l. 

To demonstrate Condition 6.2, suppose Q n G Q, H(Q n ) > K > — oo for 
all n, and (Q n ) converges weakly to some distribution Qo G "Po- By Posner 
(1975), Theorem 1, KL(P, Q) is jointly weakly lower semicontinuous in both 
arguments. In case (i), the entropy H(P) = — KL(P, /x) is thus upper semi- 
continuous in P G V, and it follows that > H(Qq) > K > — oo, which im- 
plies Qo G M = Q. In case (ii), the entropy function is lower semicontin- 
uous [Tops0e (2001)], whence < H{Qq) < oo, and again Qo £ Q- m ei- 
ther case, the lower semicontinuity of KL(P, Q) in Q then implies that, for 
P G Q, S(P, Qo) = KL(P, Qo) + H(P) < liminf n _ oc {KL(P, Q n ) + H (P)} = 
liminf n _^ 00 S'(P, Q n ). 

□ 

Theorem A. 2 essentially extends the principal arguments and results of 
Tops0e (1979) to nonlogarithmic loss functions. In such cases it might some- 
times be possible to establish the required conditions by methods similar to 
Proposition A.l, but in general this could require new techniques. 

A.4. Proof of Lemma 9.2. Suppose Condition 9.1 holds, and Conditions 
6.1 and 6.3 hold for L and T in Q. We note that H(P U ) is then bounded 
below by K and above by H* for u£Sl; for II G A, the integral in (123) and 
(124) is then bounded by the same quantities. 

To show Condition 6.1 holds for L and A in Q, let II n G A, with Bayes 
act C n G Z in Q, be such that (H(U n )) is bounded below and (II n ) converges 
weakly to IIo G A. Defining Q n := Pu n ,Qo '■= Pn > we then have Q n G T, 
with Bayes act £ n G Z in Q. Now let / :X -^TZ be bounded and continu- 
ous, and define g:@o^lZ by g(0) = Ep e {f(X)}. By the definition of weak 
convergence, the function g is continuous. It follows that Eg n {/(A)} = 
En„{<?(e)} -»• E no {<?(e)} = E Qo {f(X)}. This shows that (Q n ) converges 
weakly to Qo- Also, by (124) and Condition 9.1, the sequence (H(Q n )) 
is bounded below. It now follows from Condition 6.1 in Q T that Qo has 
a Bayes act £o in Q — any such act likewise being Bayes against IIo i n G- 
Also, for an appropriate choice of the Bayes acts (Cn) and Co, L(P,(o) < 
liminfn^oo L(P, £ n ), for all P G T. By finiteness of the integral in (123) we 
then obtain L(Tl, Co) < liminf n _ +00 L(n,C„), for all TIG A. 

We now show that Condition 6.3 holds for L and A in Q. First it is clear 
that A is convex. Since II G A and Pn £ T have the same Bayes acts (in 
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their respective games), if Pu € T has a Bayes act, then so does II. Also, 
the integral in (123) is bounded as a function of II, whence H(T1) is finite if 
H(Pyi) is, and sup ngA //(II) is finite if sup Pgr H(P) is. 
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